-
Notifications
You must be signed in to change notification settings - Fork 4
Home
Jennifer Chang edited this page Aug 20, 2021
·
31 revisions
2012 FreeBayes
- Garrison, E. and Marth, G., 2012. Haplotype-based variant detection from short-read sequencing. arXiv preprint arXiv:1207.3907.
2013 FALCON, FALCON-unzip, FALCON-Phase
- Chin, C.S., Alexander, D.H., Marks, P., Klammer, A.A., Drake, J., Heiner, C., Clum, A., Copeland, A., Huddleston, J., Eichler, E.E. and Turner, S.W., 2013. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nature methods, 10(6), pp.563-569.
- Chin, C.S., Peluso, P., Sedlazeck, F.J., Nattestad, M., Concepcion, G.T., Clum, A., Dunn, C., O'Malley, R., Figueroa-Balderas, R., Morales-Cruz, A. and Cramer, G.R., 2016. Phased diploid genome assembly with single-molecule real-time sequencing. Nature methods, 13(12), pp.1050-1054.
- Kronenberg, Z.N., Rhie, A., Koren, S., Concepcion, G.T., Peluso, P., Munson, K.M., Porubsky, D., Kuhn, K., Mueller, K.A., Low, W.Y. and Hiendleder, S., 2021. Extended haplotype-phasing of long-read de novo genome assemblies using Hi-C. Nature communications, 12(1), pp.1-10.
- "Thus, we suggest the following genome assembly workflow: (1) partially phased long-read assembly, (2) FALCON-Phase on primary contigs and haplotigs, (3) scaffolding with HI-C data, and (3) FALCON-Phase on scaffolds.
2015 Longranger
- Bishara, A., Liu, Y., Weng, Z., Kashef-Haghighi, D., Newburger, D.E., West, R., Sidow, A. and Batzoglou, S., 2015. Read clouds uncover variation in complex regions of the human genome. Genome research, 25(10), pp.1570-1580.
2016 minimap2
- Li, H., 2016. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics, 32(14), pp.2103-2110.
- Li, H., 2018. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34(18), pp.3094-3100.
2018 purge_haplotigs, purge_dups
- Roach, M.J., Schmidt, S.A. and Borneman, A.R., 2018. Purge Haplotigs: allelic contig reassignment for third-gen diploid genome assemblies. BMC bioinformatics, 19(1), pp.1-10.
- purge_haplotigs
- Guan, D., McCarthy, S.A., Wood, J., Howe, K., Wang, Y. and Durbin, R., 2020. Identifying and removing haplotypic duplication in primary genome assemblies. Bioinformatics, 36(9), pp.2896-2898.
- C source code at https://github.com/dfguan/purge_dups
- Pipeline outline: (1) minimap2 (li, 2016), (2) create windows by contigs and self align, (3) remove haplotigs, (4) chain overlaps.. something about the shorter contig. (more detail in Supplementary Material).
- "Following this [Scaff10x] with a round of polishing with Arrow closed a number of gaps, reducing contig number further and increasing contig N50" Wait... arrow merges contigs? or maybe it's Scaff10x.
- "To our knowledge, scaffolders that use long-range information, such as Scaff10X with linked reads or SALSA with Hi-C data, do not handle heterozygous overlaps. We therefore recommend applying purge_dups directly after initial assembly, prior to scaffolding."
- "In conclusion, purge_dups can significantly improve genome assemblies by removing overlaps and haplotigs caused by sequence divergence in heterozygous regions." ... removes false dups, while retaining assembly completeness, improves scaffolding
- Supplemental
# === input/output variables pfs=*.pfs # raw Pacbio read alignment PAF files asm=all_p_ctg.fasta # primary assembly..um do I include mito and haplo here? # === Purge dups commands pbcstat $pfs # will generate PB.base.cov and PB.stat calcuts PB.stat > cutoffs 2> calcults.log split_fa $asm > $asm.split.fa minimap2 -xasm5 -DP $asm.split.fa $asm.split.fa > $asm.split.self.paf purge_dups -2 -T cutoffs -c PB.base.cov $asm.split.self.paf > dups.bed 2> purge_dups.log get_seqs dups.bed $asm > purged.fa 2> hap.fa # so it separates here..haplotigs sent to stderr?
2020 Merqury
- Rhie, A., Walenz, B.P., Koren, S. and Phillippy, A.M., 2020. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome biology, 21(1), pp.1-27.
2021 merfin, mitoVGP
- Formenti, G., Rhie, A., Walenz, B.P., Thibaud-Nissen, F., Shafin, K., Koren, S., Myers, E.W., Jarvis, E.D. and Phillippy, A.M., 2021. Merfin: improved variant filtering and polishing via k-mer validation. bioRxiv.
- Formenti, G., Rhie, A., Balacco, J., Haase, B., Mountcastle, J., Fedrigo, O., Brown, S., Capodiferro, M.R., Al-Ajli, F.O., Ambrosini, R. and Houde, P., 2021. Complete vertebrate mitogenomes reveal widespread repeats and gene duplications. Genome biology, 22(1), pp.1-22.
-
Rhie, A., McCarthy, S.A., Fedrigo, O., Damas, J., Formenti, G., Koren, S., Uliano-Silva, M., Chow, W., Fungtammasan, A., Kim, J. and Lee, C., 2021. Towards complete and error-free genome assemblies of all vertebrate species. Nature, 592(7856), pp.737-746.
- "Genome heterozygosity posed additional problems, because homologous haplotypes in a diploid or polyploid genome are forced together into a single consensus by standard assemblers, sometimes creating false gene duplications."
- Website: https://vertebrategenomesproject.org
- "To our knowledge, this was the first systematic analysis of many sequence technologies, assembly algorithms, and assembly parameters applied on the same individual" heh, that would be fun
- "After fixing a function in the PacBio FALCON software that caused artificial breaks in contigs between stretches of highly homozygous and heterozygous haplotype sequences (Supplementary Note 1, Table 2), ..." did we fix this as well?
- VGP assembly pipeline (v1.0): haplotype-separated CLR contigs, scaffolding with linked reads, optical maps and Hi-C, gap filling, base call polishing, manual curation (extended data Figs 2a (polishing after scaffolding), 3a).
- VGP assembly flowchart (Extended Data Fig 3): purge dups -> scaffold -> polish {arrow, longranger+FreeBayes, longranger+FreeBayes} "with binned reads" means reads by contig?
2021 ag100pest update
-
Childers, A.K., Geib, S.M., Sim, S.B., Poelchau, M.F., Coates, B.S., Simmonds, T.J., Scully, E.D., Smith, T.P., Childers, C.P., Corpuz, R.L. and Hackett, K., 2021. The USDA-ARS Ag100Pest Initiative: High-Quality Genome Assemblies for Agricultural Pest Arthropod Research. Insects, 12(7), p.626.
- Figure 1: general workflow
- Bioproject: https://www.ncbi.nlm.nih.gov/bioproject/555319
- "Ag100Pest began by using continuous long reads (CLRs) for assembly (details not presented herein) as the improved HiFi procedure [33] had not yet been developed"