202208祖先序列重构

2022-08-08 10:03:38 阅读：230 来源： 互联网

标签：node 重构 print 202208 re flag 序列 line txt

首先理解清楚，基因发生了正选择，并不表示该基因就是发生了趋同进化（适应趋同）。那么什么才是发生了适应趋同呢？一般定义为满足两个条件：1、处于正选择；2、现生物种氨基酸位点同祖先相比存在改变（而且最好是正选择位点同改变位点时同一位点）。

现生物种氨基酸位点同祖先相比存在变化可分为两种情况：

　　1、所研究的现生目标物种在某位点处，氨基酸残基都相同，这种情况就被称为convergent site。又可进一步分为parallel substitution和convergent substitution 【Storz J F. Causes of molecular convergence and parallelism in protein evolution[J]. Nature Reviews Genetics, 2016, 17(4): 239-250. Hu Y, Wu Q I, Ma S, et al. Comparative genomics reveals convergent evolution between the bamboo-eating giant and red pandas[J]. Proceedings of the National Academy of Sciences, 2017, 114(5): 1081-1086.】

　　2、所研究的现生目标物种在某位点处，氨基酸残基不完全相同，该情况称为common change。

以上两种根据自己的研究情况，只要基因又与正选择基因重叠，就可以称该基因发生了趋同适应进化【Foote A D, Liu Y, Thomas G W C, et al. Convergent evolution of the genomes of marine mammals[J]. Nature genetics, 2015, 47(3): 272-275.】。

#文件1、蛋白序列文件（cds序列也可以，只需要修改配置文件参数seqtype = 3）

#文件2、树文件

(base) vip10t01@bio10-desktop 2022-07-17 22:17:59 ~/project/01-oxy/gene_cds/reconstrctAnc
$cat paml_tree.nwk
((Lcha:0.235889,Pann:0.271384):0.058453,(Locu:0.1986,(Sfor:0.182893,((Eele:0.166864,Drer:0.159921):0.0522582,(Olat:0.148378,((Marm:0.072489,Malb:0.0946715):0.0165605,(Carg:0.103992,(Atest:0.0589527,Bsple:0.0974119):0.0139159):0.0136644):0.0327347):0.104933):0.0345937):0.0507974):0.058453);

#文件3、配置文件

      seqfile = OG0013976.fa.bla.uniq.fa.maf.id.2.cds.fas.best.fas-gb.re.phy.stop                 * sequence data filename (核苷酸文件)
     treefile = paml_tree.nwk     * tree structure file name (不用标记树枝)
      outfile = mlc.ancestral.sequence.reconstruction        * main result file name

        noisy = 9  * 0,1,2,3,9: how much rubbish on the screen
      verbose = 2  * 0: concise; 1: detailed, 2: too much
      runmode = 0  * 0: user tree;  1: semi-automatic;  2: automatic
                   * 3: StepwiseAddition; (4,5):PerturbationNNI; -2: pairwise

      seqtype = 3  * 1:codons; 2:AAs; 3:codons-->AAs (根据自己的序列文件是氨基酸还是核酸进行选择)
        clock = 0  * 0:no clock, 1:clock; 2:local clock; 3:CombinedAnalysis
       aaDist = 0  * 0:equal, +:geometric; -:linear, 1-6:G1974,Miyata,c,p,v,a
   aaRatefile = /home/data/vip10t01/biosoftware/paml4.9j/dat/jones.dat  * only used for aa seqs with model=empirical(_F) (从安装PAML的路径中找)
                   * dayhoff.dat, jones.dat, wag.dat, mtmam.dat, or your own
        model = 2
                   * models for codons:
                       * 0:one, 1:b, 2:2 or more dN/dS ratios for branches
                   * models for AAs or codon-translated AAs:
                       * 0:poisson, 1:proportional, 2:Empirical, 3:Empirical+F
                       * 6:FromCodon, 7:AAClasses, 8:REVaa_0, 9:REVaa(nr=189)
        icode = 0  * 0:universal code; 1:mammalian mt; 2-10:see below
        Mgene = 0
                   * codon: 0:rates, 1:separate; 2:diff pi, 3:diff kapa, 4:all diff
                   * AA: 0:rates, 1:separate
    fix_alpha = 0 * 0: estimate gamma shape parameter; 1: fix it at alpha
        alpha = 0.5 * initial or fixed alpha, 0:infinity (constant rate)
       Malpha = 1  * different alphas for genes
        ncatG = 4  * # of categories in dG of NSsites models

        getSE = 0  * 0: don't want them, 1: want S.E.s of estimates
 RateAncestor = 1  * (0,1,2): rates (alpha>0) or ancestral states (1 or 2)
   Small_Diff = .5e-6
    cleandata = 0  * remove sites with ambiguity data (1:yes, 0:no)?
       method = 1  * Optimization method 0: simultaneous; 1: one branch a time

* Genetic codes: 0:universal, 1:mammalian mt., 2:yeast mt., 3:mold mt.,
* 4: invertebrate mt., 5: ciliate nuclear, 6: echinoderm mt., 
* 7: euplotid mt., 8: alternative yeast nu. 9: ascidian mt., 
* 10: blepharisma nu.
* These codes correspond to transl_table 1 to 11 of GENEBANK.

(base) vip10t01@bio10-desktop 2022-07-17 22:28:21 ~/project/01-oxy/gene_cds/reconstrctAnc
$codeml baseml_aa1.ctl #运行

由于需要的输出文件为rst，而这个文件对于每一个orthologous名字都是一样的，为了区分，只有将每个orthologous分别存于不同文件夹下运行【已证实同一文件夹下运行，最终rst文件只会覆盖】

(base) vip10t01@bio10-desktop 2022-07-19 11:34:07 ~/project/01-oxy/gene_cds/5001
$for i in `ls *.ctl_re |awk -F"." '{print $1}'`; do echo mkdir $i; done >>z_mkdir.sh #为每个文件创建独立文件夹

(base) vip10t01@bio10-desktop 2022-07-19 11:33:46 ~/project/01-oxy/gene_cds/reconstrctAnc
$mv ../5001/z_mkdir.sh .
$cat z_mkdir.sh |wc -l  #看一下数量
9868
$bash z_mkdir.sh #创建文件夹
(base) vip10t01@bio10-desktop 2022-07-19 11:35:41 ~/project/01-oxy/gene_cds/5001
$for i in `ls *.stop |awk -F"." '{print $1}'`; do echo cp $i.fa.bla.uniq.fa.maf.id.2.cds.fas.best.fas-gb.re.phy.stop ../reconstrctAnc/$i; done >>z_Ancp.sh #复制序列文件，到各自的文件夹

$bash z_Ancp.sh

(base) vip10t01@bio10-desktop 2022-07-19 12:18:03 ~/project/01-oxy/gene_cds/reconstrctAnc
$for i in `ls -F |grep "/$"`; do echo cp paml_tree.nwk $i; done >>z.sh #复制树文件到各个文件夹
$bash z.sh
$for i in `ls -F |grep "/$"`; do cp ../reconAnc.ctl $i; done #复制配置文件到各个文件夹

(base) vip10t01@bio10-desktop 2022-07-20 11:53:57 ~/project/01-oxy/gene_cds/reconstrctAnc
$for i in `ls -F |grep "/$"`; do echo "sed -i 's/@/${i}\.fa\.bla\.uniq\.fa\.maf\.id\.2\.cds\.fas\.best\.fas-gb\.re\.phy\.stop/g' ${i}reconAnc.ctl"; done >>z_re.sh #修改配置文件

(base) vip10t01@bio10-desktop 2022-07-20 11:54:22 ~/project/01-oxy/gene_cds/reconstrctAnc
$head z_re.sh #一定看一下，有问题酌情修改
sed -i 's/@/OG0000000/\.fa\.bla\.uniq\.fa\.maf\.id\.2\.cds\.fas\.best\.fas-gb\.re\.phy\.stop/g' OG0000000/reconAnc.ctl
sed -i 's/@/OG0000003/\.fa\.bla\.uniq\.fa\.maf\.id\.2\.cds\.fas\.best\.fas-gb\.re\.phy\.stop/g' OG0000003/reconAnc.ctl
sed -i 's/@/OG0000004/\.fa\.bla\.uniq\.fa\.maf\.id\.2\.cds\.fas\.best\.fas-gb\.re\.phy\.stop/g' OG0000004/reconAnc.ctl
sed -i 's/@/OG0000008/\.fa\.bla\.uniq\.fa\.maf\.id\.2\.cds\.fas\.best\.fas-gb\.re\.phy\.stop/g' OG0000008/reconAnc.ctl
sed -i 's/@/OG0000015/\.fa\.bla\.uniq\.fa\.maf\.id\.2\.cds\.fas\.best\.fas-gb\.re\.phy\.stop/g' OG0000015/reconAnc.ctl
sed -i 's/@/OG0000017/\.fa\.bla\.uniq\.fa\.maf\.id\.2\.cds\.fas\.best\.fas-gb\.re\.phy\.stop/g' OG0000017/reconAnc.ctl
sed -i 's/@/OG0000021/\.fa\.bla\.uniq\.fa\.maf\.id\.2\.cds\.fas\.best\.fas-gb\.re\.phy\.stop/g' OG0000021/reconAnc.ctl
sed -i 's/@/OG0000022/\.fa\.bla\.uniq\.fa\.maf\.id\.2\.cds\.fas\.best\.fas-gb\.re\.phy\.stop/g' OG0000022/reconAnc.ctl
sed -i 's/@/OG0000024/\.fa\.bla\.uniq\.fa\.maf\.id\.2\.cds\.fas\.best\.fas-gb\.re\.phy\.stop/g' OG0000024/reconAnc.ctl
sed -i 's/@/OG0000026/\.fa\.bla\.uniq\.fa\.maf\.id\.2\.cds\.fas\.best\.fas-gb\.re\.phy\.stop/g' OG0000026/reconAnc.ctl

$sed "s/\/\\\./\\\./g" z_re.sh |head #酌情修改

$bash z_re.sh

#创建运行代码
(base) vip10t01@bio10-desktop 2022-07-20 12:22:23 ~/project/01-oxy/gene_cds/reconstrctAnc
$for i in `ls -F |grep "/$"`; do echo "cd $i; codeml reconAnc.ctl"; done >>z_runAnce.sh
$cd OG0000000/; codeml reconAnc.ctl #用一个文件测试了一下，没啥问题。

#开始运行
(base) vip10t01@bio10-desktop 2022-07-29 07:44:18 ~/project/01-oxy/gene_cds/reconstrctAnc
$nohup parallel -j 25 <z_runAnce.sh & 【半个小时不到就跑完了】
[1] 3980356

结果处理：每个orthologous做祖先重构时，树文件都是相同的，从而我们可以分析出需要进行比较的节点

tree with node labels for Rod Page's TreeView
((1_Lcha, 2_Pann) 14 , (3_Locu, (12_Sfor, ((11_Eele, 10_Drer) 18 , (9_Olat, ((5_Marm, 4_Malb) 21 , (7_Carg, (6_Atest, 8_Bsple) 23 ) 22 ) 20 ) 19 ) 17 ) 16 ) 15 ) 13 ;

条件1：提取air-breathing和其祖先序列在某一位点处有差异【5个中有两个存在差异就行】的氨基酸位点。
　　自己细化为：5个中有1、2、3、4、5个存在差异的类型
node #14 ==》2_Pann
node #18 ==》11_Eele
node #21 ==》4_Malb
node #22 ==》7_Carg
node #23 ==》6_Atest
在条件1的基础上进行位点分类：
　　对比所有air-breathing，如果该位点全部相同，则为convergent
　　对比所有air-breathing，如果该位点存在不同，则为parallel

# _author: Administrator
# _date: 2022/7/29
# Usaeg: .py infile convergentFile commonFile goalOrthologousNameFile
P=[]
M=[]
A=[]
E=[]
C=[]
n14=[]
n18=[]
n21=[]
n22=[]
n23=[]
import re
import sys
with open(sys.argv[1], "r+") as infile:
# with open("rst", "r+") as infile:
    for line in infile.readlines():
        if line.startswith("Pann"):
            line_re = re.sub("Pann| |\n", "", line)
            P.append(line_re)
            # print(P)
        elif line.startswith("Malb"):
            line_re = re.sub("Malb| |\n", "", line)
            M.append(line_re)
        elif line.startswith("Atest"):
            line_re = re.sub("Atest| |\n", "", line)
            A.append(line_re)
        elif line.startswith("Carg"):
            line_re = re.sub("Carg| |\n", "", line)
            C.append(line_re)
        elif line.startswith("Eele"):
            line_re = re.sub("Eele| |\n", "", line)
            E.append(line_re)
        elif line.startswith("node #14"):
            line_re = re.sub("node #14| |\n", "", line)
            n14.append(line_re)
        elif line.startswith("node #18"):
            line_re = re.sub("node #18| |\n", "", line)
            n18.append(line_re)
        elif line.startswith("node #21"):
            line_re = re.sub("node #21| |\n", "", line)
            n21.append(line_re)
        elif line.startswith("node #22"):
            line_re = re.sub("node #22| |\n", "", line)
            n22.append(line_re)
        elif line.startswith("node #23"):
            line_re = re.sub("node #23| |\n", "", line)
            n23.append(line_re)
#node #14 ==》2_Pann
pn=[]
flag=0
for a, b in zip(P[0], n14[0]): #字符串对应比较差异
    if a != b:
        locate = a, b, flag
        pn.append(locate)
        # print('a:', a, ' b:', b, ' site:',flag)
        flag+=1
    else:
        flag+=1
# print(pn)
#node18 ==》11_Eele
flag=0
en=[]
for a, b in zip(E[0], n18[0]): #字符串对应比较差异
    if a != b:
        locate = a, b, flag
        en.append(locate)
        # print('a:', a, ' b:', b, ' site:',flag)
        flag+=1
    else:
        flag+=1
# print(en)
#node #21 ==》4_Malb
flag=0
mn=[]
for a, b in zip(M[0], n21[0]): #字符串对应比较差异
    if a != b:
        locate = a, b, flag
        mn.append(locate)
        # print('a:', a, ' b:', b, ' site:',flag)
        flag+=1
    else:
        flag+=1
# print(mn)
#node #22 ==》7_Carg
flag=0
cn=[]
for a, b in zip(C[0], n22[0]): #字符串对应比较差异
    if a != b:
        locate = a, b, flag
        cn.append(locate)
        # print('a:', a, ' b:', b, ' site:',flag)
        flag+=1
    else:
        flag+=1
# print(cn)
#node #23 ==》6_Atest
flag=0
an=[]
for a, b in zip(C[0], n22[0]): #字符串对应比较差异
    if a != b:
        locate = a, b, flag
        an.append(locate)
        # print('a:', a, ' b:', b, ' site:',flag)
        flag+=1
    else:
        flag+=1
# print(an)
# times=[]
tmp1=[x[2] for x in an] #https://blog.csdn.net/weixin_38082364/article/details/79833348
tmp2=[x[2] for x in pn]
tmp3=[x[2] for x in en]
tmp4=[x[2] for x in cn]
tmp5=[x[2] for x in mn]
# print(tmp1)
# print(tmp2)
times=tmp1+tmp2+tmp3+tmp4+tmp5 #几个列表合成一个列表
# print(times)

sets=set(times)
dict={}
for item in sets:
    dict.update({item:times.count(item)})
# print(dict)
times5=[k for k,v in dict.items() if v==5] #输出次数为5的键(表示该位点在5个air-breathing中均有变化)
print(times5)
# print(A)
# convergentSite=open(sys.argv[2],"a+")
# convergentSite=open("11-out1.txt","a+")
# parallelSite=open(sys.argv[3],"a+")
# parallelSite=open("11-out2.txt","a+")
goalOrthologous=open(sys.argv[4],"a+") #用于存储目标orthologous的文件名
# goalOrthologous=open("11-tmp.txt","a+") #用于存储目标orthologous的文件名
for i in times5:
    i_list=[i,str(E[0])[i],str(A[0])[i],str(P[0])[i],str(C[0])[i],str(M[0])[i]]
    # print(i_list)
    if "-" not in i_list: #只考虑位点非空（即：如果序列在该位点有"-"，则不考虑该位点）【如果考虑，可以发现很多位点都是"-"】
        # print(i_list[1:-1])
        goalOrthologous.write(infile.name+"\n") #输出orthologous名字【同一个文件可能输出多次，后面再处理】
        if len(set(i_list[1:-1]))==1: #convergent sites
            print(i_list)
            # convergentSite=open("11-out1.txt","a+")
            convergentSite=open(sys.argv[2],"a+")
            #注意输出格式为：siteNumber E A P C M
            convergentSite.write(str(i_list[0])+"\t"+" ".join(i_list[1:])+"\n")
        else: #common changes
            print(i_list)
            # parallelSite=open("11-out2.txt","a+")
            parallelSite=open(sys.argv[3],"a+")
            parallelSite.write(str(i_list[0])+"\t"+" ".join(i_list[1:])+"\n")

(base) vip10t01@bio10-desktop 2022-07-30 22:31:55 ~/project/01-oxy/gene_cds/reconstrctAnc/OG0009213
$python3 ../ancestorSeq.py OG0009213/rst OG0009213/convergentFile.txt OG0009213/parallelFile.txt z_goalOrthologous.txt  #验证可用性【结果完全符合预期】

#写循环
(base) vip10t01@bio10-desktop 2022-07-30 22:42:10 ~/project/01-oxy/gene_cds/reconstrctAnc
$for i in `ls -F |grep "/$"`; do echo "python3 ../ancestorSeq.py ${i}rst ${i}convergentFile.txt ${i}parallelFile.txt z_goalOrthologous.txt"; done >>z_ancestor.sh
#看看文件如下，符合预期

(base) vip10t01@bio10-desktop 2022-07-30 22:49:49 ~/project/01-oxy/gene_cds/reconstrctAnc
$nohup bash z_ancestor.sh &  #运行程序  
[1] 1221584

#可能遇到重跑的情况
(base) vip10t01@bio10-desktop 2022-07-30 23:06:06 ~/project/01-oxy/gene_cds/reconstrctAnc
$find -name "parallelFile.txt" |xargs rm #删除指定文件名
$find -name "convergentFile.txt" |xargs rm
$rm nohup.out
$rm z_goalOrthologous.txt

尝试结果1

#代码出了上述问题时，结果如下
(base) vip10t01@bio10-desktop 2022-07-31 07:43:35 ~/project/01-oxy/gene_cds/reconstrctAnc
$find -name "parallelFile.txt" |wc -l #common change
5249

(base) vip10t01@bio10-desktop 2022-07-31 07:51:19 ~/project/01-oxy/gene_cds/reconstrctAnc
$find -name "convergentFile.txt" |wc -l #convergent site
33

#修改代码后：

(base) vip10t01@bio10-desktop 2022-07-31 07:58:47 ~/project/01-oxy/gene_cds/reconstrctAnc
$find -name "convergentFile.txt" |xargs rm
$find -name "parallelFile.txt" |xargs rm
$rm nohup.out
$rm z_goalOrthologous.txt

$nohup bash z_ancestor.sh &  【很快就跑完了】
[1] 1682506

(base) vip10t01@bio10-desktop 2022-07-31 07:43:35 ~/project/01-oxy/gene_cds/reconstrctAnc
$find -name "parallelFile.txt" |wc -l #common changes
5253

(base) vip10t01@bio10-desktop 2022-07-31 07:51:19 ~/project/01-oxy/gene_cds/reconstrctAnc
$find -name "convergentFile.txt" |wc -l #convergent sites
18

(base) vip10t01@bio10-desktop 2022-07-31 09:18:22 ~/project/01-oxy/gene_cds/reconstrctAnc
$find -name "convergentFile.txt"  #convergent orthologous
./OG0011532/convergentFile.txt
./OG0003214/convergentFile.txt
./OG0000890/convergentFile.txt
./OG0001393/convergentFile.txt
./OG0006886/convergentFile.txt
./OG0001851/convergentFile.txt
./OG0006300/convergentFile.txt
./OG0004181/convergentFile.txt
./OG0001577/convergentFile.txt
./OG0009321/convergentFile.txt
./OG0009594/convergentFile.txt
./OG0007484/convergentFile.txt
./OG0002346/convergentFile.txt
./OG0000982/convergentFile.txt
./OG0011231/convergentFile.txt
./OG0008506/convergentFile.txt
./OG0011059/convergentFile.txt
./OG0012207/convergentFile.txt

上面python脚本修改一下，将由祖先的什么氨基酸位点变为现在的什么氨基酸位点都输出。再跑一遍

# _author: Administrator
# _date: 2022/7/29
# Usaeg: .py infile convergentFile commonFile goalOrthologousNameFile
P=[]
M=[]
A=[]
E=[]
C=[]
n14=[]
n18=[]
n21=[]
n22=[]
n23=[]
import re
import sys
with open(sys.argv[1], "r+") as infile:
# with open("rst", "r+") as infile:
    for line in infile.readlines():
        if line.startswith("Pann"):
            line_re = re.sub("Pann| |\n", "", line)
            P.append(line_re)
            # print(P)
        elif line.startswith("Malb"):
            line_re = re.sub("Malb| |\n", "", line)
            M.append(line_re)
        elif line.startswith("Atest"):
            line_re = re.sub("Atest| |\n", "", line)
            A.append(line_re)
        elif line.startswith("Carg"):
            line_re = re.sub("Carg| |\n", "", line)
            C.append(line_re)
        elif line.startswith("Eele"):
            line_re = re.sub("Eele| |\n", "", line)
            E.append(line_re)
        elif line.startswith("node #14"):
            line_re = re.sub("node #14| |\n", "", line)
            n14.append(line_re)
        elif line.startswith("node #18"):
            line_re = re.sub("node #18| |\n", "", line)
            n18.append(line_re)
        elif line.startswith("node #21"):
            line_re = re.sub("node #21| |\n", "", line)
            n21.append(line_re)
        elif line.startswith("node #22"):
            line_re = re.sub("node #22| |\n", "", line)
            n22.append(line_re)
        elif line.startswith("node #23"):
            line_re = re.sub("node #23| |\n", "", line)
            n23.append(line_re)
#node #14 ==》2_Pann
pn=[]
flag=0
for a, b in zip(P[0], n14[0]): #字符串对应比较差异
    if a != b:
        locate = a, b, flag
        pn.append(locate)
        # print('a:', a, ' b:', b, ' site:',flag)
        flag+=1
    else:
        flag+=1
# print(pn)
#node18 ==》11_Eele
flag=0
en=[]
for a, b in zip(E[0], n18[0]): #字符串对应比较差异
    if a != b:
        locate = a, b, flag
        en.append(locate)
        # print('a:', a, ' b:', b, ' site:',flag)
        flag+=1
    else:
        flag+=1
# print(en)
#node #21 ==》4_Malb
flag=0
mn=[]
for a, b in zip(M[0], n21[0]): #字符串对应比较差异
    if a != b:
        locate = a, b, flag
        mn.append(locate)
        # print('a:', a, ' b:', b, ' site:',flag)
        flag+=1
    else:
        flag+=1
# print(mn)
#node #22 ==》7_Carg
flag=0
cn=[]
for a, b in zip(C[0], n22[0]): #字符串对应比较差异
    if a != b:
        locate = a, b, flag
        cn.append(locate)
        # print('a:', a, ' b:', b, ' site:',flag)
        flag+=1
    else:
        flag+=1
# print(cn)
#node #23 ==》6_Atest
flag=0
an=[]
for a, b in zip(C[0], n22[0]): #字符串对应比较差异
    if a != b:
        locate = a, b, flag
        an.append(locate)
        # print('a:', a, ' b:', b, ' site:',flag)
        flag+=1
    else:
        flag+=1
# print(an)
# times=[]
tmp1=[x[2] for x in an] #https://blog.csdn.net/weixin_38082364/article/details/79833348
tmp2=[x[2] for x in pn]
tmp3=[x[2] for x in en]
tmp4=[x[2] for x in cn]
tmp5=[x[2] for x in mn]
# print(tmp1)
# print(tmp2)
times=tmp1+tmp2+tmp3+tmp4+tmp5 #几个列表合成一个列表
# print(times)

sets=set(times)
dict={}
for item in sets:
    dict.update({item:times.count(item)})
# print(dict)
times5=[k for k,v in dict.items() if v==5] #输出次数为5的键(表示该位点在5个air-breathing中均有变化)
print(times5)
# print(A)
# convergentSite=open(sys.argv[2],"a+")
# convergentSite=open("11-out1.txt","a+")
# parallelSite=open(sys.argv[3],"a+")
# parallelSite=open("11-out2.txt","a+")
goalOrthologous=open(sys.argv[4],"a+") #用于存储目标orthologous的文件名
# goalOrthologous=open("11-tmp.txt","a+") #用于存储目标orthologous的文件名
for i in times5:
    i_list=[i,str(E[0])[i],str(A[0])[i],str(P[0])[i],str(C[0])[i],str(M[0])[i]]
    node_list=[str(n14[0][i]),str(n18[0][i]),str(n21[0][i]),str(n22[0][i]),str(n23[0][i])] #修改处
    # print(i_list)
    if "-" not in i_list: #只考虑位点非空（即：如果序列在该位点有"-"，则不考虑该位点）
        # print(i_list[1:-1])
        goalOrthologous.write(infile.name+"\n")
        if len(set(i_list[1:]))==1: #convergent
            print(i_list)
            # convergentSite=open("11-out1.txt","a+")
            convergentSite=open(sys.argv[2],"a+")
            #注意输出格式为：siteNumber E A P C M node14 node18 node21 node22 node23
            convergentSite.write(str(i_list[0])+"\t"+" ".join(i_list[1:])+" ".join(node_list[:])+"\n") #修改处
        else: #parallel
            print(i_list)
            # parallelSite=open("11-out2.txt","a+")
            parallelSite=open(sys.argv[3],"a+")
            parallelSite.write(str(i_list[0])+"\t"+" ".join(i_list[1:])+" ".join(node_list[:])+"\n") #修改处

(base) vip10t01@bio10-desktop 2022-07-31 10:31:18 ~/project/01-oxy/gene_cds/reconstrctAnc
$find -name "convergentFile.txt" |xargs rm
$find -name "parallelFile.txt" |xargs rm
$rm z_goalOrthologous.txt

(base) vip10t01@bio10-desktop 2022-07-31 11:09:13 ~/project/01-oxy/gene_cds/reconstrctAnc
$nohup bash z_ancestor.sh & 【正在跑】
[1] 1857903

#python代码再修改一下，将改变处氨基酸位点的每条序列的对应位置的氨基酸都输出

# _author: Administrator
# _date: 2022/7/29
# Usaeg: .py infile convergentFile commonFile goalOrthologousNameFile
P=[]
M=[]
A=[]
E=[]
C=[]
n14=[]
n18=[]
n21=[]
n22=[]
n23=[]

LC=[]
LO=[]
BS=[]
MA=[]
OL=[]
DR=[]
SF=[]
N13=[]
N15=[]
N16=[]
N17=[]
N19=[]
N20=[]
import re
import sys
with open(sys.argv[1], "r+") as infile:
# with open("rst", "r+") as infile:
    for line in infile.readlines():
        if line.startswith("Pann"):
            line_re = re.sub("Pann| |\n", "", line)
            P.append(line_re)
            # print(P)
        elif line.startswith("Malb"):
            line_re = re.sub("Malb| |\n", "", line)
            M.append(line_re)
        elif line.startswith("Atest"):
            line_re = re.sub("Atest| |\n", "", line)
            A.append(line_re)
        elif line.startswith("Carg"):
            line_re = re.sub("Carg| |\n", "", line)
            C.append(line_re)
        elif line.startswith("Eele"):
            line_re = re.sub("Eele| |\n", "", line)
            E.append(line_re)
        elif line.startswith("node #14"):
            line_re = re.sub("node #14| |\n", "", line)
            n14.append(line_re)
        elif line.startswith("node #18"):
            line_re = re.sub("node #18| |\n", "", line)
            n18.append(line_re)
        elif line.startswith("node #21"):
            line_re = re.sub("node #21| |\n", "", line)
            n21.append(line_re)
        elif line.startswith("node #22"):
            line_re = re.sub("node #22| |\n", "", line)
            n22.append(line_re)
        elif line.startswith("node #23"):
            line_re = re.sub("node #23| |\n", "", line)
            n23.append(line_re)

        elif line.startswith("Lcha"):
            line_re = re.sub("Lcha| |\n", "", line)
            LC.append(line_re)
        elif line.startswith("Locu"):
            line_re = re.sub("Locu| |\n", "", line)
            LO.append(line_re)
        elif line.startswith("Bsple"):
            line_re = re.sub("Bsple| |\n", "", line)
            BS.append(line_re)
        elif line.startswith("Marm"):
            line_re = re.sub("Marm| |\n", "", line)
            MA.append(line_re)
        elif line.startswith("Olat"):
            line_re = re.sub("Olat| |\n", "", line)
            OL.append(line_re)
        elif line.startswith("Drer"):
            line_re = re.sub("Drer| |\n", "", line)
            DR.append(line_re)
        elif line.startswith("Sfor"):
            line_re = re.sub("Sfor| |\n", "", line)
            SF.append(line_re)
        elif line.startswith("node #13"):
            line_re = re.sub("node #13| |\n", "", line)
            N13.append(line_re)
        elif line.startswith("node #15"):
            line_re = re.sub("node #15| |\n", "", line)
            N15.append(line_re)
        elif line.startswith("node #16"):
            line_re = re.sub("node #16| |\n", "", line)
            N16.append(line_re)
        elif line.startswith("node #17"):
            line_re = re.sub("node #17| |\n", "", line)
            N17.append(line_re)
        elif line.startswith("node #19"):
            line_re = re.sub("node #19| |\n", "", line)
            N19.append(line_re)
        elif line.startswith("node #20"):
            line_re = re.sub("node #20| |\n", "", line)
            N20.append(line_re)
#node #14 ==》2_Pann
pn=[]
flag=0
for a, b in zip(P[0], n14[0]): #字符串对应比较差异
    if a != b:
        locate = a, b, flag
        pn.append(locate)
        # print('a:', a, ' b:', b, ' site:',flag)
        flag+=1
    else:
        flag+=1
# print(pn)
#node18 ==》11_Eele
flag=0
en=[]
for a, b in zip(E[0], n18[0]): #字符串对应比较差异
    if a != b:
        locate = a, b, flag
        en.append(locate)
        # print('a:', a, ' b:', b, ' site:',flag)
        flag+=1
    else:
        flag+=1
# print(en)
#node #21 ==》4_Malb
flag=0
mn=[]
for a, b in zip(M[0], n21[0]): #字符串对应比较差异
    if a != b:
        locate = a, b, flag
        mn.append(locate)
        # print('a:', a, ' b:', b, ' site:',flag)
        flag+=1
    else:
        flag+=1
# print(mn)
#node #22 ==》7_Carg
flag=0
cn=[]
for a, b in zip(C[0], n22[0]): #字符串对应比较差异
    if a != b:
        locate = a, b, flag
        cn.append(locate)
        # print('a:', a, ' b:', b, ' site:',flag)
        flag+=1
    else:
        flag+=1
# print(cn)
#node #23 ==》6_Atest
flag=0
an=[]
for a, b in zip(C[0], n22[0]): #字符串对应比较差异
    if a != b:
        locate = a, b, flag
        an.append(locate)
        # print('a:', a, ' b:', b, ' site:',flag)
        flag+=1
    else:
        flag+=1
# print(an)
# times=[]
tmp1=[x[2] for x in an] #https://blog.csdn.net/weixin_38082364/article/details/79833348
tmp2=[x[2] for x in pn]
tmp3=[x[2] for x in en]
tmp4=[x[2] for x in cn]
tmp5=[x[2] for x in mn]
# print(tmp1)
# print(tmp2)
times=tmp1+tmp2+tmp3+tmp4+tmp5 #几个列表合成一个列表
# print(times)

sets=set(times)
dict={}
for item in sets:
    dict.update({item:times.count(item)})
# print(dict)
times5=[k for k,v in dict.items() if v==5] #输出次数为5的键(表示该位点在5个air-breathing中均有变化)
print(times5)
# print(A)
# convergentSite=open(sys.argv[2],"a+")
# convergentSite=open("11-out1.txt","a+")
# parallelSite=open(sys.argv[3],"a+")
# parallelSite=open("11-out2.txt","a+")
goalOrthologous=open(sys.argv[4],"a+") #用于存储目标orthologous的文件名
# goalOrthologous=open("11-tmp.txt","a+") #用于存储目标orthologous的文件名
for i in times5:
    i_list=[i,str(E[0])[i],str(A[0])[i],str(P[0])[i],str(C[0])[i],str(M[0])[i]]
    node_list=[str(n14[0][i]),str(n18[0][i]),str(n21[0][i]),str(n22[0][i]),str(n23[0][i])]
    I_list=[str(LC[0])[i],str(LO[0])[i],str(BS[0])[i],str(MA[0])[i],str(OL[0])[i],str(DR[0])[i],str(SF[0])[i]]
    Node_list=[str(N13[0][i]),str(N15[0][i]),str(N16[0][i]),str(N17[0][i]),str(N19[0][i]),str(N20[0][i])]
    # print(i_list)
    if "-" not in i_list: #只考虑位点非空（即：如果序列在该位点有"-"，则不考虑该位点）
        # print(i_list[1:-1])
        goalOrthologous.write(infile.name+"\n")
        if len(set(i_list[1:]))==1: #convergent
            print(i_list)
            # convergentSite=open("11-out1.txt","a+")
            convergentSite=open(sys.argv[2],"a+")
            #注意输出格式为：siteNumber Eele Atest Pann Carg Malb node14 node18 node21 node22 node23 @ Lcha Locu Bsple Marm Olat Drer Sfor @ node13 node15 node16 node17 node19 node20
            convergentSite.write(str(i_list[0])+"\t"+" ".join(i_list[1:])+" ".join(node_list[:])+" @ "+" ".join(I_list[:])+" @ "+" ".join(Node_list[:])+"\n")
        else: #parallel
            print(i_list)
            # parallelSite=open("11-out2.txt","a+")
            parallelSite=open(sys.argv[3],"a+")
            parallelSite.write(str(i_list[0])+"\t"+" ".join(i_list[1:])+" ".join(node_list[:])+" @ "+" ".join(I_list[:])+" @ "+" ".join(Node_list[:])+"\n")

(base) vip10t01@bio10-desktop 2022-07-31 12:22:38 ~/project/01-oxy/gene_cds/reconstrctAnc
$rm nohup.out
$find -name "convergentFile.txt" |xargs rm
$find -name "parallelFile.txt" |xargs rm
$rm z_goalOrthologous.txt

(base) vip10t01@bio10-desktop 2022-07-31 12:56:51 ~/project/01-oxy/gene_cds/reconstrctAnc
$nohup bash z_ancestor.sh & 
[1] 1889760

(base) vip10t01@bio10-desktop 2022-07-31 19:31:50 ~/project/01-oxy/gene_cds/reconstrctAnc
$find -name "parallelFile.txt" |wc -l #跟前面完全相同，结果可靠
5253

(base) vip10t01@bio10-desktop 2022-07-31 19:35:50 ~/project/01-oxy/gene_cds/reconstrctAnc
$find -name "convergentFile.txt" |wc -l #跟前面完全相同，结果可靠
18

结果处理：直接处理convergent。手动统计，但是统计过程中发现需要统计位点由什么变为什么，这个是以后做的时候需要改进的，这里量少，手动统计就行了。

首先判断了基因是否是one-to-one gene；

然后判断基因是否是正选择基因，从而该基因是否为适应进化基因；

接下来判断是parallel site还是convergent site

#主要的结果统计
(base) vip10t01@bio10-desktop 2022-08-01 09:30:39 ~/project/01-oxy/gene_cds/reconstrctAnc
$cat z_goalOrthologous.txt |uniq |wc -l  #判断总的有多少个基因
5259

$find -name "parallelFile.txt" |wc -l  #查考有多少个基因有common sites
5253
$find -name "parallelFile.txt" |xargs cat >>z_commonSite.txt #用来判断有多少个common sites
$cat z_commonSite.txt |wc -l  #common sites数量
12878

(base) vip10t01@bio10-desktop 2022-08-01 09:39:05 ~/project/01-oxy/gene_cds/reconstrctAnc
$find -name "convergentFile.txt" |wc -l  #查看多少个基因有convergent site
18
$find -name "convergentFile.txt" |xargs cat >>z_convergentSite.txt #用来判断有多少个convergent sites
$cat z_convergentSite.txt |wc -l #convergent sites数量
18

(base) vip10t01@bio10-desktop 2022-08-01 16:06:23 ~/project/01-oxy/gene_cds/reconstrctAnc
$find -name "parallelFile.txt" >z_commonSiteFile.txt
$sed "s/\.//g" z_commonSiteFile.txt |sed "s/\/parallelFiletxt/\.fa\.bla\.uniq\.fa\.maf\.id\.2\.cds\.fas\.best\.fas-gb\.re\.phy\.stop\.out/g" |sed "s/\///g" |head

标签：node,重构,print,202208,re,flag,序列,line,txt
来源： https://www.cnblogs.com/ly-zy/p/16560757.html

本站声明： 1. iCode9 技术分享网（下文简称本站）提供的所有内容，仅供技术学习、探讨和分享；
2. 关于本站的所有留言、评论、转载及引用，纯属内容发起人的个人观点，与本站观点和立场无关；
3. 关于本站的所有言论和文字，纯属内容发起人的个人观点，与本站观点和立场无关；
4. 本站文章均是网友提供，不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属；如您发现该文章侵犯了您的权益，可联系我们第一时间进行删除；
5. 本站为非盈利性的个人网站，所有内容不会用来进行牟利，也不会利用任何形式的广告来间接获益，纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

ICode9

202208祖先序列重构

结果处理：每个orthologous做祖先重构时，树文件都是相同的，从而我们可以分析出需要进行比较的节点