山东大学生物信息学实验 - 转录组学数据分析和富集分析

问题如下：

|使用Python或者Perl编程，完成以下任务：

1.tab是Trichoderma reesei QM6a的annotation文件，2.obo是GO的ontology文件，根据ontology文件中所记录的GO关系：

(a) 获得已注释的所有基因属于哪一个GO的全部列表

(b) 计算Trichoderma reesei QM6a中每一个GO分别有多少个基因。

DEG.txt中记录了某一次转录组分析的差异基因结果。请计算：

(a) 每一个GO中所属的基因显著上调和下调的个数有多少？

(b) 哪些GO发生了显著的上调或者下调的富集？

文件解释：

先说一下文件，1.tab是如下图所示的annotation文件，大概两万多行，有proteinId，gotermId，goName，gotermType，goAcc的信息，其中本次用到的的是proteinId和goAcc。goAcc是该基因所属的GO号，形式是GO:加七位数字。一个基因能对应多个GO id，一个GO id也能对应多个基因。

proteinId	gotermId	goName	gotermType	goAcc
1673	3504	translational initiation	biological_process	GO:0006413
1673	2855	cytoplasm	cellular_component	GO:0005737
1673	1108	translation initiation factor activity	molecular_function	GO:0003743
1673	1096	RNA binding	molecular_function	GO:0003723
1692	3503	protein biosynthesis	biological_process	GO:0006412
1692	2953	ribosome	cellular_component	GO:0005840
1692	2746	intracellular	cellular_component	GO:0005622
1692	1107	structural constituent of ribosome	molecular_function	GO:0003735
1692	1096	RNA binding	molecular_function	GO:0003723
1702	3444	transcription	biological_process	GO:0006350
1702	1233	DNA-directed DNA polymerase activity	molecular_function	GO:0003887
1702	1056	DNA binding	molecular_function	GO:0003677
1702	10782	sigma DNA polymerase activity	molecular_function	GO:0019984
1702	8197	theta DNA polymerase activity	molecular_function	GO:0016452
1702	8196	nu DNA polymerase activity	molecular_function	GO:0016451
1702	8195	kappa DNA polymerase activity	molecular_function	GO:0016450
1702	8194	lambda DNA polymerase activity	molecular_function	GO:0016449
1702	8193	mu DNA polymerase activity	molecular_function

2.obo记录了GO之间的上下级关系，文件本身三十几万行，47125个GO，如下所示的[Term]表示一个GO，随后id：GO：XXXXXXX就是这个GO的 id，namespace就是这个GO属于本体论里三个顶级GO的哪一个，比如GO:0000001属于biological_process。

第一题的（a）就是要找到所有GO的所有上级GO（或者说父亲），最终归宿到biological_process、molecular_function、cellular_component。

format-version: 1.2
data-version: releases/2018-04-12
subsetdef: goantislim_grouping "Grouping classes that can be excluded"
subsetdef: gocheck_do_not_annotate "Term not to be used for direct annotation"
subsetdef: gocheck_do_not_manually_annotate "Term not to be used for direct manual annotation"
subsetdef: goslim_agr "AGR slim"
subsetdef: goslim_aspergillus "Aspergillus GO slim"
subsetdef: goslim_candida "Candida GO slim"
subsetdef: goslim_chembl "ChEMBL protein targets summary"
subsetdef: goslim_generic "Generic GO slim"
subsetdef: goslim_goa "GOA and proteome slim"
subsetdef: goslim_metagenomics "Metagenomics GO slim"
subsetdef: goslim_mouse "Mouse GO slim"
subsetdef: goslim_pir "PIR GO slim"
subsetdef: goslim_plant "Plant GO slim"
subsetdef: goslim_pombe "Fission yeast GO slim"
subsetdef: goslim_synapse "synapse GO slim"
subsetdef: goslim_virus "Viral GO slim"
subsetdef: goslim_yeast "Yeast GO slim"
subsetdef: gosubset_prok "Prokaryotic GO subset"
subsetdef: mf_needs_review "Catalytic activity terms in need of attention"
subsetdef: termgenie_unvetted "Terms created by TermGenie that do not follow a template and require additional vetting by editors"
subsetdef: virus_checked "Viral overhaul terms"
synonymtypedef: syngo_official_label "label approved by the SynGO project"
synonymtypedef: systematic_synonym "Systematic synonym" EXACT
default-namespace: gene_ontology
remark: cvs version: use data-version
remark: Includes Ontology(OntologyID(Anonymous-35)) [Axioms: 230 Logical Axioms: 228]
remark: Includes Ontology(OntologyID(OntologyIRI(<http://purl.obolibrary.org/obo/go/never_in_taxon.owl>))) [Axioms: 18 Logical Axioms: 0]
ontology: go

[Term]
id: GO:0000001
name: mitochondrion inheritance
namespace: biological_process
def: "The distribution of mitochondria, including the mitochondrial genome, into daughter cells after mitosis or meiosis, mediated by interactions between mitochondria and the cytoskeleton." [GOC:mcc, PMID:10873824, PMID:11389764]
synonym: "mitochondrial inheritance" EXACT []
is_a: GO:0048308 ! organelle inheritance
is_a: GO:0048311 ! mitochondrion distribution

[Term]
id: GO:0000002
name: mitochondrial genome maintenance
namespace: biological_process
def: "The maintenance of the structure and integrity of the mitochondrial genome; includes replication and segregation of the mitochondrial chromosome." [GOC:ai, GOC:vw]
is_a: GO:0007005 ! mitochondrion organization

[Term]
id: GO:0000003
name: reproduction
namespace: biological_process
alt_id: GO:0019952
alt_id: GO:0050876
def: "The production of new individuals that contain some portion of genetic material inherited from one or more parent organisms." [GOC:go_curators, GOC:isa_complete, GOC:jl, ISBN:0198506732]
subset: goslim_agr
subset: goslim_chembl
subset: goslim_generic
subset: goslim_pir
subset: goslim_plant
subset: gosubset_prok
synonym: "reproductive physiological process" EXACT []
xref: Wikipedia:Reproduction
is_a: GO:0008150 ! biological_process

[Term]
id: GO:0000005
name: obsolete ribosomal chaperone activity
namespace: molecular_function
def: "OBSOLETE. Assists in the correct assembly of ribosomes or ribosomal subunits in vivo, but is not a component of the assembled ribosome when performing its normal biological function." [GOC:jl, PMID:12150913]
comment: This term was made obsolete because it refers to a class of gene products and a biological process rather than a molecular function.
synonym: "ribosomal chaperone activity" EXACT []
is_obsolete: true
consider: GO:0042254
consider: GO:0044183
consider: GO:0051082
......
......

这最后的is_a: GO：XXXXXXX! 意思是本GO是后面这个GO的一部分，是它的下级，relationship: part_of GO：XXXXXXX!是同样的意思。

有些GO和其他GO什么关系都没有，最后有个is_obsolete: true，说明它过时了，同时还会在def:最前面加上 OBSOLETE 。

还有一些其他的关系，在这里暂时无须明白里面的所有内容,可以去看看一文极速读懂 Gene Ontology，老师说找relationship: part_of、is_a:建关系即可。

#DEG.txt中记录了某一次转录组分析的差异基因结果
gene_id	A	B	p-value	FDR	significance
105313	476.125	2.01685	0	0	DOWN
81082	188.012	0.388781	4.44E-16	0.000106687	DOWN
70204	30.8683	0.0378778	3.33E-15	0.000106687	DOWN
112031	74.6035	0.708632	4.66E-15	0.000106687	DOWN
122511	1065.56	8.16165	1.44E-14	0.000106687	DOWN
49274	106.879	1.05212	1.44E-14	0.000106687	DOWN
68803	186.04	2.46745	1.69E-14	0.000106687	DOWN
123914	149.261	1.65829	3.82E-14	0.000106687	DOWN
59362	56.5321	0.138204	4.22E-14	0.000106687	DOWN
105882	195.885	0.663057	5.64E-14	0.000106687	DOWN
108676	703.881	2.29893	1.20E-13	0.000106687	DOWN
60489	1.33266	157.422	4.06E-13	0.000106687	UP
66935	2.52345	123.558	2.10E-12	0.000106687	UP
76659	214.616	1.82799	3.86E-12	0.000106687	DOWN

解决问题

获得已注释的所有基因属于哪一个GO的全部列表

#引入包
import os
import numpy as np
import pandas as pd
import csv
import re
import time

#一开始想减少耗时，把2.obo中用不到的，即除is_a:|relationship: part_of|id: 的行全删了。
with open('2.obo','r') as f:
    while True:
        line = f.readline()
        strline = str(line)
        h = re.match('is_a: GO|relationship: part_of|id: GO',strline)
        if h is not None :
            clean = open('2_clean.obo','a')
            clean.write(line)
            clean.close()
        if not line:
            break
f.close()

#字典法
def dictGO():
    father_list =[]
    with open('2_clean.obo','r') as f:
        while True:
            line = f.readline()
            strline = str(line)
            
            m = re.match('id: GO',line)
            if m is not None:
                if len(father_list) != 0:
                    dict_[term] = father_list
                    father_list = []
                term = re.search('GO:\d{7}',line).group()
                
            else:
                h = re.match('is_a: GO|relationship: part_of',strline)
                if h is not None:
                    father = re.search('GO:\d{7}',line).group()
                    father_list.append(father)
                    
            if not line:
                break
                
    f.close()
    
#迭代找爹
def finder(a):
    global num
    father = dict_.get(a)
    if father:
        father_list.extend(father)
        num = num + 1
        
        for i in father:
            finder(i)
    else:
        num_list.append(num)
        num = 0

#构建字典
dict_ ={}
dictGO()

tab = pd.read_csv('1.csv') #读入1.obo数据，可以将之换成csv读入，反正能读成数据框就行

with open("dict.csv","w",newline='') as csvfile: 
    writer = csv.writer(csvfile, delimiter=' ')
    writer.writerow(["GO_id,","father_num,","father_generation,","father_id"]) #"GO_id,"结尾的逗号作为csv的分隔符
    for j in range(0,len(tab['goAcc'])-1):
        a = tab['goAcc'][j]
        father_list = []
        num_list = []
        num = 0
        finder(a)
        father_list = sorted(set(father_list),key=father_list.index)#列表去重复
        nummax = max(num_list) #这里求了父节点有几代，因为计算方法是深度优先，取了最大值
        writer.writerow([tab['goAcc'][j],",",len(father_list),",",nummax,",",father_list])

csvfile.close()

计算Trichoderma reesei QM6a中每一个GO分别有多少个基因

方便起见，稍微改一改，另外建一个记录下级节点的字典。

gocleanlist = []  #所有GO的非重复列表
with open('2_clean.obo','r') as f:
    while True:
            line = f.readline()
            strline = str(line)
            
            m = re.match('id: GO',line)
            if m is not None:
                gocleanlist.append(re.search('GO:\d{7}',line).group())
                
            if not line:
                break
f.close()
len(gocleanlist)

#找所有儿子
def dictsonGO():
    
    with open('2_clean.obo','r') as f:
        while True:
            line = f.readline()
            strline = str(line)
            
            m = re.match('id: GO',line)
            if m is not None: 
                son_m = re.search('GO:\d{7}',line).group()
            
            h = re.match('is_a: GO|relationship: part_of',strline)
            if h is not None:
                father = re.search('GO:\d{7}',line).group()
                if father not in dict_son:
                    dict_son[father] = [son_m]
                else:
                    dict_son[father].append(son_m)
                
            if not line:
                break
                
    f.close()

def finderson(a):
    son = dict_son.get(a)
    if son:
        son_list.extend(son)
        
        for i in son:
            if i in finished: #找完所有儿子并统计过的GO跳过，相当于剪枝了
                pass
            else:
                finderson(i)

dict_son ={}
dictsonGO()

#儿子计数
list_ =[]
for j in range(0,len(tab['goAcc'])-1): 
    a = tab['goAcc'][j]
    list_.append(a)
result = pd.value_counts(list_) #将Trichoderma reesei QM6a里所有基因所在的GO计数，基因数量最高的GO有500+个基因
result_frame = result.to_frame()
dict_result ={} #见字典好去索引，比如A是B的父级，B在该字典里，就把键值（基因数量）也给A加上
for i in range(0,len(result)-1):
    dict_result[result_frame[0].keys()[i]] = result_frame[0][i]

后面就是对所有GO进行计数，想了想，干脆一起把下一问也一起干了。

即每一个GO中所属的基因显著上调和下调的个数有多少？

deg = pd.read_csv('DEG.txt',sep='\t', encoding='utf8')
Down = []
Up = []
#这个也不多，我循环套循环跑了，觉得浑身难受的就直接建字典。
for i in range(0,len(deg['gene_id'])):
    for j in range(0,len(tab['proteinId'])):
        if deg['gene_id'][i] == tab['proteinId'][j]:
            if deg['significance'][i] == 'DOWN':
                Down.append(tab['goAcc'][j])
            if deg['significance'][i] == 'UP':
                Up.append(tab['goAcc'][j])
#列表建字典，主要后面搜索省时间
Down_dict = {}
Up_dict = {}
for i in Down:
    Down_dict[i] = 1
for i in Up:
    Up_dict[i] = 1

with open("count_all.csv","w",newline='') as csvfile: 
    writer = csv.writer(csvfile, delimiter=' ')
    writer.writerow(["GO_id,","direct_Gene_number,","all_of_gene,","Down,","Up"]) #direct_Gene_number是直属于这个GO的基因，all_of_gene额外加上了下级GO的基因数量。
    finished = [] # finderson 找完的记录在里面
    finish_down={} # 顺便把down
    finish_up={}
    time =0
    for a in gocleanlist:
        #可以加个输出计量运行到哪儿了
        son_list = []
        Down_num = 0
        Up_num=0
        n = dict_result.get(a,0)
        all_of_gene = n
        finderson(a) 
        son_list = sorted(set(son_list),key=son_list.index)
        for z in son_list:
            all_of_gene = all_of_gene + dict_result.get(z,0)
            if z in Down_dict:
                if z in finish_down:
                    Down_num = Down_num + finish_down[z]
                else:
                    Down_num =Down_num+1
            if z in Up_dict:
                if z in finish_up:
                    Up_num = Up_num + finish_up[z]
                else:
                    Up_num =Up_num+1
        if all_of_gene == 0: #没有就不记了，省地方，不然得一堆0中找值
            pass
        else:
            writer.writerow([a,",",n,",",all_of_gene,",",Down_num,",",Up_num])
        finished.append(a)
        finish_down[a]=Down_num
        finish_up[a]=Up_num
        
csvfile.close()

#输出便是此番结果
GO_id, direct_Gene_number, all_of_gene, Down, Up
GO:0000002 , 1 , 1 , 0 , 0
GO:0000003 , 0 , 1 , 0 , 0
GO:0000009 , 15 , 15 , 0 , 0
GO:0000015 , 2 , 2 , 0 , 0
GO:0000026 , 15 , 18 , 1 , 0
GO:0000030 , 19 , 104 , 5 , 0
GO:0000033 , 15 , 15 , 0 , 0
GO:0000036 , 3 , 3 , 0 , 0
GO:0000041 , 0 , 8 , 0 , 1
GO:0000049 , 1 , 1 , 0 , 0
GO:0000059 , 10 , 10 , 0 , 0
GO:0000062 , 2 , 2 , 0 , 0
GO:0000087 , 1 , 1 , 0 , 0
GO:0000096 , 0 , 11 , 0 , 0
GO:0000097 , 0 , 8 , 0 , 0
GO:0000103 , 2 , 3 , 0 , 0
GO:0000104 , 3 , 6 , 0 , 0
GO:0000105 , 7 , 7 , 0 , 0
GO:0000107 , 2 , 2 , 0 , 0
GO:0000121 , 1 , 1 , 0 , 0
GO:0000139 , 3 , 3 , 0 , 0
GO:0000140 , 8 , 8 , 0 , 0
GO:0000145 , 3 , 3 , 0 , 0
GO:0000148 , 2 , 2 , 0 , 0
GO:0000150 , 1 , 1 , 0 , 0
GO:0000151 , 56 , 58 , 0 , 0
......

哪些GO发生了显著的上调或者下调的富集？

做GO的富集是算超几何分布，相比python，使用R更简单。

library(dplyr)
setwd("C:/Users/syddd/bioinfomation")
cou <- read.csv("count_all.csv")
N = 13267+3043+2410  # 三个顶级GO的基因数量相加
n_down = 150 + 22 + 9
n_up = 222 + 32 + 16
df1<-data.frame(GO_id=c(), down_p_value=c(), up_p_value=c())
df2<-data.frame(GO_id=c(), down_p_value=c())
df3<-data.frame(GO_id=c(), up_p_value=c())

for (i in 1:nrow(cou)) {
  k_down <- cou[i,"Down"] 
  k_up <- cou[i,"Up"]
  m <- cou[i,"all_of_gene"]
  
  p_down = 1-phyper(k_down-1,m, N-m, n_down)
  p_up = 1-phyper(k_up-1,m, N-m, n_up)
  list1 <- list(GO_id=cou[i,"GO_id"], down_p_value=p_down, up_p_value=p_up)
  df1 <- rbind(df1,as.data.frame(list1))
  if (p_down <= 0.05){ #挑出来下调显著的
    list2 <- list(GO_id=cou[i,"GO_id"], down_p_value=p_down)
    df2 <- rbind(df2,as.data.frame(list2)) 
  } 
  if (p_up <= 0.05){ #挑出来上调显著的
    list3 <- list(GO_id=cou[i,"GO_id"], up_p_value=p_up)
    df3 <- rbind(df3,as.data.frame(list3))
  }
}