问题如下:
|使用Python或者Perl编程,完成以下任务:
- 1.tab是Trichoderma reesei QM6a的annotation文件,2.obo是GO的ontology文件,根据ontology文件中所记录的GO关系:
(a) 获得已注释的所有基因属于哪一个GO的全部列表
(b) 计算Trichoderma reesei QM6a中每一个GO分别有多少个基因。
- DEG.txt中记录了某一次转录组分析的差异基因结果。请计算:
(a) 每一个GO中所属的基因显著上调和下调的个数有多少?
(b) 哪些GO发生了显著的上调或者下调的富集?
文件解释:
先说一下文件,1.tab是如下图所示的annotation文件,大概两万多行,有proteinId,gotermId,goName,gotermType,goAcc的信息,其中本次用到的的是proteinId和goAcc。goAcc是该基因所属的GO号,形式是GO:加七位数字。一个基因能对应多个GO id,一个GO id也能对应多个基因。
proteinId gotermId goName gotermType goAcc
1673 3504 translational initiation biological_process GO:0006413
1673 2855 cytoplasm cellular_component GO:0005737
1673 1108 translation initiation factor activity molecular_function GO:0003743
1673 1096 RNA binding molecular_function GO:0003723
1692 3503 protein biosynthesis biological_process GO:0006412
1692 2953 ribosome cellular_component GO:0005840
1692 2746 intracellular cellular_component GO:0005622
1692 1107 structural constituent of ribosome molecular_function GO:0003735
1692 1096 RNA binding molecular_function GO:0003723
1702 3444 transcription biological_process GO:0006350
1702 1233 DNA-directed DNA polymerase activity molecular_function GO:0003887
1702 1056 DNA binding molecular_function GO:0003677
1702 10782 sigma DNA polymerase activity molecular_function GO:0019984
1702 8197 theta DNA polymerase activity molecular_function GO:0016452
1702 8196 nu DNA polymerase activity molecular_function GO:0016451
1702 8195 kappa DNA polymerase activity molecular_function GO:0016450
1702 8194 lambda DNA polymerase activity molecular_function GO:0016449
1702 8193 mu DNA polymerase activity molecular_function
2.obo记录了GO之间的上下级关系,文件本身三十几万行,47125个GO,如下所示的[Term]表示一个GO,随后id:GO:XXXXXXX就是这个GO的 id,namespace就是这个GO属于本体论里三个顶级GO的哪一个,比如GO:0000001属于biological_process。
第一题的(a)就是要找到所有GO的所有上级GO(或者说父亲),最终归宿到biological_process、molecular_function、cellular_component。
format-version: 1.2
data-version: releases/2018-04-12
subsetdef: goantislim_grouping "Grouping classes that can be excluded"
subsetdef: gocheck_do_not_annotate "Term not to be used for direct annotation"
subsetdef: gocheck_do_not_manually_annotate "Term not to be used for direct manual annotation"
subsetdef: goslim_agr "AGR slim"
subsetdef: goslim_aspergillus "Aspergillus GO slim"
subsetdef: goslim_candida "Candida GO slim"
subsetdef: goslim_chembl "ChEMBL protein targets summary"
subsetdef: goslim_generic "Generic GO slim"
subsetdef: goslim_goa "GOA and proteome slim"
subsetdef: goslim_metagenomics "Metagenomics GO slim"
subsetdef: goslim_mouse "Mouse GO slim"
subsetdef: goslim_pir "PIR GO slim"
subsetdef: goslim_plant "Plant GO slim"
subsetdef: goslim_pombe "Fission yeast GO slim"
subsetdef: goslim_synapse "synapse GO slim"
subsetdef: goslim_virus "Viral GO slim"
subsetdef: goslim_yeast "Yeast GO slim"
subsetdef: gosubset_prok "Prokaryotic GO subset"
subsetdef: mf_needs_review "Catalytic activity terms in need of attention"
subsetdef: termgenie_unvetted "Terms created by TermGenie that do not follow a template and require additional vetting by editors"
subsetdef: virus_checked "Viral overhaul terms"
synonymtypedef: syngo_official_label "label approved by the SynGO project"
synonymtypedef: systematic_synonym "Systematic synonym" EXACT
default-namespace: gene_ontology
remark: cvs version: use data-version
remark: Includes Ontology(OntologyID(Anonymous-35)) [Axioms: 230 Logical Axioms: 228]
remark: Includes Ontology(OntologyID(OntologyIRI(<http://purl.obolibrary.org/obo/go/never_in_taxon.owl>))) [Axioms: 18 Logical Axioms: 0]
ontology: go
[Term]
id: GO:0000001
name: mitochondrion inheritance
namespace: biological_process
def: "The distribution of mitochondria, including the mitochondrial genome, into daughter cells after mitosis or meiosis, mediated by interactions between mitochondria and the cytoskeleton." [GOC:mcc, PMID:10873824, PMID:11389764]
synonym: "mitochondrial inheritance" EXACT []
is_a: GO:0048308 ! organelle inheritance
is_a: GO:0048311 ! mitochondrion distribution
[Term]
id: GO:0000002
name: mitochondrial genome maintenance
namespace: biological_process
def: "The maintenance of the structure and integrity of the mitochondrial genome; includes replication and segregation of the mitochondrial chromosome." [GOC:ai, GOC:vw]
is_a: GO:0007005 ! mitochondrion organization
[Term]
id: GO:0000003
name: reproduction
namespace: biological_process
alt_id: GO:0019952
alt_id: GO:0050876
def: "The production of new individuals that contain some portion of genetic material inherited from one or more parent organisms." [GOC:go_curators, GOC:isa_complete, GOC:jl, ISBN:0198506732]
subset: goslim_agr
subset: goslim_chembl
subset: goslim_generic
subset: goslim_pir
subset: goslim_plant
subset: gosubset_prok
synonym: "reproductive physiological process" EXACT []
xref: Wikipedia:Reproduction
is_a: GO:0008150 ! biological_process
[Term]
id: GO:0000005
name: obsolete ribosomal chaperone activity
namespace: molecular_function
def: "OBSOLETE. Assists in the correct assembly of ribosomes or ribosomal subunits in vivo, but is not a component of the assembled ribosome when performing its normal biological function." [GOC:jl, PMID:12150913]
comment: This term was made obsolete because it refers to a class of gene products and a biological process rather than a molecular function.
synonym: "ribosomal chaperone activity" EXACT []
is_obsolete: true
consider: GO:0042254
consider: GO:0044183
consider: GO:0051082
......
......
这最后的is_a: GO:XXXXXXX! 意思是本GO是后面这个GO的一部分,是它的下级,relationship: part_of GO:XXXXXXX!是同样的意思。
有些GO和其他GO什么关系都没有,最后有个is_obsolete: true,说明它过时了,同时还会在def:最前面加上 OBSOLETE 。
还有一些其他的关系,在这里暂时无须明白里面的所有内容,可以去看看一文极速读懂 Gene Ontology,老师说找relationship: part_of、is_a:建关系即可。
#DEG.txt中记录了某一次转录组分析的差异基因结果
gene_id A B p-value FDR significance
105313 476.125 2.01685 0 0 DOWN
81082 188.012 0.388781 4.44E-16 0.000106687 DOWN
70204 30.8683 0.0378778 3.33E-15 0.000106687 DOWN
112031 74.6035 0.708632 4.66E-15 0.000106687 DOWN
122511 1065.56 8.16165 1.44E-14 0.000106687 DOWN
49274 106.879 1.05212 1.44E-14 0.000106687 DOWN
68803 186.04 2.46745 1.69E-14 0.000106687 DOWN
123914 149.261 1.65829 3.82E-14 0.000106687 DOWN
59362 56.5321 0.138204 4.22E-14 0.000106687 DOWN
105882 195.885 0.663057 5.64E-14 0.000106687 DOWN
108676 703.881 2.29893 1.20E-13 0.000106687 DOWN
60489 1.33266 157.422 4.06E-13 0.000106687 UP
66935 2.52345 123.558 2.10E-12 0.000106687 UP
76659 214.616 1.82799 3.86E-12 0.000106687 DOWN
解决问题
获得已注释的所有基因属于哪一个GO的全部列表
#引入包
import os
import numpy as np
import pandas as pd
import csv
import re
import time
#一开始想减少耗时,把2.obo中用不到的,即除is_a:|relationship: part_of|id: 的行全删了。
with open('2.obo','r') as f:
while True:
line = f.readline()
strline = str(line)
h = re.match('is_a: GO|relationship: part_of|id: GO',strline)
if h is not None :
clean = open('2_clean.obo','a')
clean.write(line)
clean.close()
if not line:
break
f.close()
#字典法
def dictGO():
father_list =[]
with open('2_clean.obo','r') as f:
while True:
line = f.readline()
strline = str(line)
m = re.match('id: GO',line)
if m is not None:
if len(father_list) != 0:
dict_[term] = father_list
father_list = []
term = re.search('GO:\d{7}',line).group()
else:
h = re.match('is_a: GO|relationship: part_of',strline)
if h is not None:
father = re.search('GO:\d{7}',line).group()
father_list.append(father)
if not line:
break
f.close()
#迭代找爹
def finder(a):
global num
father = dict_.get(a)
if father:
father_list.extend(father)
num = num + 1
for i in father:
finder(i)
else:
num_list.append(num)
num = 0
#构建字典
dict_ ={}
dictGO()
tab = pd.read_csv('1.csv') #读入1.obo数据,可以将之换成csv读入,反正能读成数据框就行
with open("dict.csv","w",newline='') as csvfile:
writer = csv.writer(csvfile, delimiter=' ')
writer.writerow(["GO_id,","father_num,","father_generation,","father_id"]) #"GO_id,"结尾的逗号作为csv的分隔符
for j in range(0,len(tab['goAcc'])-1):
a = tab['goAcc'][j]
father_list = []
num_list = []
num = 0
finder(a)
father_list = sorted(set(father_list),key=father_list.index)#列表去重复
nummax = max(num_list) #这里求了父节点有几代,因为计算方法是深度优先,取了最大值
writer.writerow([tab['goAcc'][j],",",len(father_list),",",nummax,",",father_list])
csvfile.close()
计算Trichoderma reesei QM6a中每一个GO分别有多少个基因
方便起见,稍微改一改,另外建一个记录下级节点的字典。
gocleanlist = [] #所有GO的非重复列表
with open('2_clean.obo','r') as f:
while True:
line = f.readline()
strline = str(line)
m = re.match('id: GO',line)
if m is not None:
gocleanlist.append(re.search('GO:\d{7}',line).group())
if not line:
break
f.close()
len(gocleanlist)
#找所有儿子
def dictsonGO():
with open('2_clean.obo','r') as f:
while True:
line = f.readline()
strline = str(line)
m = re.match('id: GO',line)
if m is not None:
son_m = re.search('GO:\d{7}',line).group()
h = re.match('is_a: GO|relationship: part_of',strline)
if h is not None:
father = re.search('GO:\d{7}',line).group()
if father not in dict_son:
dict_son[father] = [son_m]
else:
dict_son[father].append(son_m)
if not line:
break
f.close()
def finderson(a):
son = dict_son.get(a)
if son:
son_list.extend(son)
for i in son:
if i in finished: #找完所有儿子并统计过的GO跳过,相当于剪枝了
pass
else:
finderson(i)
dict_son ={}
dictsonGO()
#儿子计数
list_ =[]
for j in range(0,len(tab['goAcc'])-1):
a = tab['goAcc'][j]
list_.append(a)
result = pd.value_counts(list_) #将Trichoderma reesei QM6a里所有基因所在的GO计数,基因数量最高的GO有500+个基因
result_frame = result.to_frame()
dict_result ={} #见字典好去索引,比如A是B的父级,B在该字典里,就把键值(基因数量)也给A加上
for i in range(0,len(result)-1):
dict_result[result_frame[0].keys()[i]] = result_frame[0][i]
后面就是对所有GO进行计数,想了想,干脆一起把下一问也一起干了。
即每一个GO中所属的基因显著上调和下调的个数有多少?
deg = pd.read_csv('DEG.txt',sep='\t', encoding='utf8')
Down = []
Up = []
#这个也不多,我循环套循环跑了,觉得浑身难受的就直接建字典。
for i in range(0,len(deg['gene_id'])):
for j in range(0,len(tab['proteinId'])):
if deg['gene_id'][i] == tab['proteinId'][j]:
if deg['significance'][i] == 'DOWN':
Down.append(tab['goAcc'][j])
if deg['significance'][i] == 'UP':
Up.append(tab['goAcc'][j])
#列表建字典,主要后面搜索省时间
Down_dict = {}
Up_dict = {}
for i in Down:
Down_dict[i] = 1
for i in Up:
Up_dict[i] = 1
with open("count_all.csv","w",newline='') as csvfile:
writer = csv.writer(csvfile, delimiter=' ')
writer.writerow(["GO_id,","direct_Gene_number,","all_of_gene,","Down,","Up"]) #direct_Gene_number是直属于这个GO的基因,all_of_gene额外加上了下级GO的基因数量。
finished = [] # finderson 找完的记录在里面
finish_down={} # 顺便把down
finish_up={}
time =0
for a in gocleanlist:
#可以加个输出计量运行到哪儿了
son_list = []
Down_num = 0
Up_num=0
n = dict_result.get(a,0)
all_of_gene = n
finderson(a)
son_list = sorted(set(son_list),key=son_list.index)
for z in son_list:
all_of_gene = all_of_gene + dict_result.get(z,0)
if z in Down_dict:
if z in finish_down:
Down_num = Down_num + finish_down[z]
else:
Down_num =Down_num+1
if z in Up_dict:
if z in finish_up:
Up_num = Up_num + finish_up[z]
else:
Up_num =Up_num+1
if all_of_gene == 0: #没有就不记了,省地方,不然得一堆0中找值
pass
else:
writer.writerow([a,",",n,",",all_of_gene,",",Down_num,",",Up_num])
finished.append(a)
finish_down[a]=Down_num
finish_up[a]=Up_num
csvfile.close()
#输出便是此番结果
GO_id, direct_Gene_number, all_of_gene, Down, Up
GO:0000002 , 1 , 1 , 0 , 0
GO:0000003 , 0 , 1 , 0 , 0
GO:0000009 , 15 , 15 , 0 , 0
GO:0000015 , 2 , 2 , 0 , 0
GO:0000026 , 15 , 18 , 1 , 0
GO:0000030 , 19 , 104 , 5 , 0
GO:0000033 , 15 , 15 , 0 , 0
GO:0000036 , 3 , 3 , 0 , 0
GO:0000041 , 0 , 8 , 0 , 1
GO:0000049 , 1 , 1 , 0 , 0
GO:0000059 , 10 , 10 , 0 , 0
GO:0000062 , 2 , 2 , 0 , 0
GO:0000087 , 1 , 1 , 0 , 0
GO:0000096 , 0 , 11 , 0 , 0
GO:0000097 , 0 , 8 , 0 , 0
GO:0000103 , 2 , 3 , 0 , 0
GO:0000104 , 3 , 6 , 0 , 0
GO:0000105 , 7 , 7 , 0 , 0
GO:0000107 , 2 , 2 , 0 , 0
GO:0000121 , 1 , 1 , 0 , 0
GO:0000139 , 3 , 3 , 0 , 0
GO:0000140 , 8 , 8 , 0 , 0
GO:0000145 , 3 , 3 , 0 , 0
GO:0000148 , 2 , 2 , 0 , 0
GO:0000150 , 1 , 1 , 0 , 0
GO:0000151 , 56 , 58 , 0 , 0
......
哪些GO发生了显著的上调或者下调的富集?
做GO的富集是算超几何分布,相比python,使用R更简单。
library(dplyr)
setwd("C:/Users/syddd/bioinfomation")
cou <- read.csv("count_all.csv")
N = 13267+3043+2410 # 三个顶级GO的基因数量相加
n_down = 150 + 22 + 9
n_up = 222 + 32 + 16
df1<-data.frame(GO_id=c(), down_p_value=c(), up_p_value=c())
df2<-data.frame(GO_id=c(), down_p_value=c())
df3<-data.frame(GO_id=c(), up_p_value=c())
for (i in 1:nrow(cou)) {
k_down <- cou[i,"Down"]
k_up <- cou[i,"Up"]
m <- cou[i,"all_of_gene"]
p_down = 1-phyper(k_down-1,m, N-m, n_down)
p_up = 1-phyper(k_up-1,m, N-m, n_up)
list1 <- list(GO_id=cou[i,"GO_id"], down_p_value=p_down, up_p_value=p_up)
df1 <- rbind(df1,as.data.frame(list1))
if (p_down <= 0.05){ #挑出来下调显著的
list2 <- list(GO_id=cou[i,"GO_id"], down_p_value=p_down)
df2 <- rbind(df2,as.data.frame(list2))
}
if (p_up <= 0.05){ #挑出来上调显著的
list3 <- list(GO_id=cou[i,"GO_id"], up_p_value=p_up)
df3 <- rbind(df3,as.data.frame(list3))
}
}