Skip to content
Jennifer Chang edited this page Aug 20, 2021 · 31 revisions

Timeline

2012 FreeBayes

2013 FALCON, FALCON-unzip, FALCON-Phase

2016 minimap2

2018 purge_haplotigs, purge_dups

  • Roach, M.J., Schmidt, S.A. and Borneman, A.R., 2018. Purge Haplotigs: allelic contig reassignment for third-gen diploid genome assemblies. BMC bioinformatics, 19(1), pp.1-10.
    • purge_haplotigs
  • Guan, D., McCarthy, S.A., Wood, J., Howe, K., Wang, Y. and Durbin, R., 2020. Identifying and removing haplotypic duplication in primary genome assemblies. Bioinformatics, 36(9), pp.2896-2898.
    • C source code at https://github.com/dfguan/purge_dups
    • Pipeline outline: (1) minimap2 (li, 2016), (2) create windows by contigs and self align, (3) remove haplotigs, (4) chain overlaps.. something about the shorter contig. (more detail in Supplementary Material).
    • "Following this [Scaff10x] with a round of polishing with Arrow closed a number of gaps, reducing contig number further and increasing contig N50" Wait... arrow merges contigs? or maybe it's Scaff10x.
    • "To our knowledge, scaffolders that use long-range information, such as Scaff10X with linked reads or SALSA with Hi-C data, do not handle heterozygous overlaps. We therefore recommend applying purge_dups directly after initial assembly, prior to scaffolding."
    • "In conclusion, purge_dups can significantly improve genome assemblies by removing overlaps and haplotigs caused by sequence divergence in heterozygous regions." ... removes false dups, while retaining assembly completeness, improves scaffolding
    • Supplemental
    # === input/output variables
    pfs=*.pfs                # raw Pacbio read alignment PAF files
    asm=all_p_ctg.fasta      # primary assembly..um do I include mito and haplo here?
    
    # === Purge dups commands
    pbcstat $pfs       # will generate PB.base.cov and PB.stat
    calcuts PB.stat > cutoffs 2> calcults.log
    split_fa $asm > $asm.split.fa
    minimap2 -xasm5 -DP $asm.split.fa $asm.split.fa > $asm.split.self.paf
    purge_dups -2 -T cutoffs -c PB.base.cov $asm.split.self.paf > dups.bed 2> purge_dups.log
    get_seqs dups.bed $asm > purged.fa 2> hap.fa        # so it separates here..haplotigs sent to stderr?
    

2020 Merqury

2021 merfin, mitoVGP

  • Formenti, G., Rhie, A., Walenz, B.P., Thibaud-Nissen, F., Shafin, K., Koren, S., Myers, E.W., Jarvis, E.D. and Phillippy, A.M., 2021. Merfin: improved variant filtering and polishing via k-mer validation. bioRxiv.
  • Formenti, G., Rhie, A., Balacco, J., Haase, B., Mountcastle, J., Fedrigo, O., Brown, S., Capodiferro, M.R., Al-Ajli, F.O., Ambrosini, R. and Houde, P., 2021. Complete vertebrate mitogenomes reveal widespread repeats and gene duplications. Genome biology, 22(1), pp.1-22.
  • Rhie, A., McCarthy, S.A., Fedrigo, O., Damas, J., Formenti, G., Koren, S., Uliano-Silva, M., Chow, W., Fungtammasan, A., Kim, J. and Lee, C., 2021. Towards complete and error-free genome assemblies of all vertebrate species. Nature, 592(7856), pp.737-746.
    • "Genome heterozygosity posed additional problems, because homologous haplotypes in a diploid or polyploid genome are forced together into a single consensus by standard assemblers, sometimes creating false gene duplications."
    • Website: https://vertebrategenomesproject.org
    • "To our knowledge, this was the first systematic analysis of many sequence technologies, assembly algorithms, and assembly parameters applied on the same individual" heh, that would be fun
    • "After fixing a function in the PacBio FALCON software that caused artificial breaks in contigs between stretches of highly homozygous and heterozygous haplotype sequences (Supplementary Note 1, Table 2), ..." did we fix this as well?
    • VGP assembly pipeline (v1.0): haplotype-separated CLR contigs, scaffolding with linked reads, optical maps and Hi-C, gap filling, base call polishing, manual curation (extended data Figs 2a (polishing after scaffolding), 3a).
    • VGP assembly flowchart (Extended Data Fig 3): purge dups -> scaffold -> polish {arrow, longranger+FreeBayes, longranger+FreeBayes} "with binned reads" means reads by contig?

2021 ag100pest update

Online Videos

Clone this wiki locally