基因組測序、組裝與分析總結
1. 測序前的準備
搜集物種相關信息,比如基因組大小,雜合度,
1.1 獲取基因組大小
基因組大小的獲取關系到對以后組裝結果的大小的正確與否判斷;基因組太大(>10Gb),超出了目前denovo組裝基因組軟件的對機器內存的要求,從客觀條件上講是無法實現組裝的。
一般物種的基因組大小可以從(http://www.genomesize.com/ )這個數據庫查到。如果沒有搜錄,需要考慮通過實驗(流式細胞儀)獲得基因組大小。
1.1.1 流式細胞儀估計基因組大小的例子:
Yoshida, S., J. K. Ishida, et al. (2010). "A full-length enriched cDNA library and expressed sequence tag analysis of the parasitic weed, Striga hermonthica." BMC Plant Biol 10: 55.
1.1.2 基于福爾根染色估計基因組大小的描述:
這本書比較經典,重點推薦:Gregory, T. (2005). The evolution of the genome, Academic Press.
1.1.3 定量pcr估計基因組大小的例子:
Wilhelm, J., A. Pingoud, et al. (2003). "Real-time PCR-based method for the estimation of genome sizes." Nucleic Acids Res 31(10): e56.
Jeyaprakash, A. and M. A. Hoy (2009). "The nuclear genome of the phytoseiid Metaseiulus occidentalis (Acari: Phytoseiidae) is among the smallest known in arthropods." Exp Appl Acarol 47(4): 263-273.
1.1.4 Kmer估計基因組大小的例子:
Kim, E. B., X. Fang, et al. (2011). "Genome sequencing reveals insights into physiology and longevity of the naked mole rat." Nature 479(7372): 223-227.
1.2 雜合度估計
雜合度對基因組組裝的影響主要體現在不能合并姊妹染色體,雜合度高的區域,會把兩條姊妹染色單體都組裝出來,從而造成組裝的基因組偏大于實際的基因組大小。
一般是通過SSR在測序親本的子代中檢查SSR的多態性。雜合度如果高于0.5%,則認為組裝有一定難度。雜合度高于1%則很難組裝出來。
雜和度估計一般通過kmer分析來做,這里有一個例子:
http://www.nature.com/nature/journal/vaop/ncurrent/full/nature11413.html
降低雜合度可以通過很多代近交來實現。
雜合度高,并不是說組裝不出來,而是說,裝出來的序列不適用于后續的生物學分析。比如拷貝數、基因完整結構。
1.3 是否有遺傳圖譜可用
隨著測序對質量要求越來越高和相關技術的逐漸成熟,遺傳圖譜也快成了denovo基因組的必須組成。構建遺傳圖構建相關概念可以參考這本書(The handbook of plant genome mapping: genetic and physical mapping )
1.4 生物學問題的調研
這一步也是很重要的
2. 測序樣品準備
確定第一步沒問題,就意味著這個物種是可以嘗試測序的。測序樣品對一些物種也是很大問題的,某些物種取樣本身就是一個挑戰的問題。
基因組測序用的樣品最好是來自于同一個個體,這樣可以降低個體間的雜和對組裝的影響。大片段對此無要求。
3. 測序策略的選擇
一般都是用不同梯度的插入片段來測序,小片段(200,500,800)和大片段(1k, 2kb 5kb 10kb 20kb 40kb)。如果是雜合度高和重復序列較多的物種,可能要采取fosmid-by-fosmid或者fosmid pooling的策略。
不言而喻,后者花費是相當高的。
4. 基因組組裝
4.1 組裝相關綜述:
Li, Z., Y. Chen, et al. (2012). "Comparison of the two major classes of assembly algorithms: overlap-layout-consensus and de-bruijn-graph." Brief Funct Genomics 11(1): 25-37.
Treangen, T. J. and S. L. Salzberg (2012). "Repetitive DNA and next-generation sequencing: computational challenges and solutions." Nat Rev Genet 13(1): 36-46.
http://www.cbcb.umd.edu/research/assembly_primer.shtml
Schatz, M. C., J. Witkowski, et al. (2012). "Current challenges in de novo plant genome sequencing and assembly." Genome Biol 13(4): 243
Baker, M. (2012). "De novo genome assembly: what every biologist should know." Nat Methods 9(4): 333-337. (重點推薦)
Compeau, P. E., et al. (2011). "How to apply de Bruijn graphs to genome assembly." Nat Biotechnol 29(11): 987-991.
Birney, E. (2011). "Assemblies: the good, the bad, the ugly." Nat Methods 8(1): 59-60.
Schatz, M. C., et al. (2010). "Assembly of large genomes using second-generation sequencing." Genome Res 20(9): 1165-1173.
4.2 糾錯軟件:
Kelley, D. R., M. C. Schatz, et al. (2010). "Quake: quality-aware detection and correction of sequencing errors." Genome Biol 11(11): R116.
4.3 組裝軟件比較
Salzberg, S. L., A. M. Phillippy, et al. (2012). "GAGE: A critical evaluation of genome assemblies and assembly algorithms." Genome Res 22(3): 557-567.
Zhang, W., et al. (2011). "A practical comparison of de novo genome assembly software tools for next-generation sequencing technologies." PLoS One 6(3): e17915.
Narzisi, G. and B. Mishra (2011). "Comparing de novo genome assembly: the long and short of it." PLoS One 6(4): e19175.
Lin, Y., et al. (2011). "Comparative Studies of de novo Assembly Tools for Next-generation Sequencing Technologies." Bioinformatics.
Hayden, E. C. (2011). "Genome builders face the competition." Nature 471(7339): 425.
Finotello, F., et al. (2011). "Comparative analysis of algorithms for whole-genome assembly of pyrosequencing data." Brief Bioinform.
Earl, D. A., et al. (2011). "Assemblathon 1: A competitive assessment of de novo short read assembly methods." Genome Res.
4.4 組裝質量評估
Schatz, M. C., et al. (2011). "Hawkeye and AMOS: visualizing and assessing the quality of genome assemblies." Brief Bioinform.
Riba-Grognuz, O., et al. (2011). "Visualization and quality assessment of de novo genome assemblies." Bioinformatics.
個人見解:
目前大基因組的denovo組裝主流軟件還是ALLPATH-LG SOAPdenovo
ALLPATH-LG的優點是:組裝的連續性最好,準確性最好,但是消耗內存較大,不是太好使用
SOAPdenovo的優點是:速度快,消耗的內存可以接受,組裝的連續性還可以,但是錯誤相對要多一些。
當然,上述評述并不是在所有情況下的,對不同物種,不同數據,他們的表現可能會不一樣。
基于Overlap-layout的方法的組裝軟件首推CABOG,這是當年用來組裝果蠅基因組的原型。另外,快要發布的MSR-CA貌似也不錯,其整合了上述所有軟件的優點,來勢很猛啊。
5. 基因組注釋
Yandell, M. and D. Ence (2012). "A beginner's guide to eukaryotic genome annotation." Nat Rev Genet 13(5): 329-342.
6. 基因組可視化
Nielsen, C. B., M. Cantor, et al. (2010). "Visualizing genomes: techniques and challenges." Nat Methods 7(3 Suppl): S5-S15.
7. 進化分析
Yang, Z. and B. Rannala (2012). "Molecular phylogenetics: principles and practice." Nat Rev Genet 13(5): 303-314.
8. 經典案例
Colbourne, J. K., M. E. Pfrender, et al. (2011). "The ecoresponsive genome of Daphnia pulex." Science 331(6017): 555-561.
Kim, E. B., X. Fang, et al. (2011). "Genome sequencing reveals insights into physiology and longevity of the naked mole rat." Nature 479(7372): 223-227.
Grbic, M., T. Van Leeuwen, et al. (2011). "The genome of Tetranychus urticae reveals herbivorous pest adaptations." Nature 479(7374): 487-492.
以上內容轉載自:測序中國seq.cn(http://seq.cn/4607-48597)
浙公網安備 33010602011771號