Skip to content
Jennifer Chang edited this page Aug 22, 2021 · 31 revisions

Timeline

2012 FreeBayes

2013 FALCON, FALCON-unzip, FALCON-Phase

2015 Longranger

2016 minimap2

2018 purge_haplotigs, purge_dups

  • Roach, M.J., Schmidt, S.A. and Borneman, A.R., 2018. Purge Haplotigs: allelic contig reassignment for third-gen diploid genome assemblies. BMC bioinformatics, 19(1), pp.1-10.
    • purge_haplotigs
  • Guan, D., McCarthy, S.A., Wood, J., Howe, K., Wang, Y. and Durbin, R., 2020. Identifying and removing haplotypic duplication in primary genome assemblies. Bioinformatics, 36(9), pp.2896-2898.
    • C source code at https://github.com/dfguan/purge_dups
    • Pipeline outline: (1) minimap2 (li, 2016), (2) create windows by contigs and self align, (3) remove haplotigs, (4) chain overlaps.. something about the shorter contig. (more detail in Supplementary Material).
    • "Following this [Scaff10x] with a round of polishing with Arrow closed a number of gaps, reducing contig number further and increasing contig N50" Wait... arrow merges contigs? or maybe it's Scaff10x.
    • "To our knowledge, scaffolders that use long-range information, such as Scaff10X with linked reads or SALSA with Hi-C data, do not handle heterozygous overlaps. We therefore recommend applying purge_dups directly after initial assembly, prior to scaffolding."
    • "In conclusion, purge_dups can significantly improve genome assemblies by removing overlaps and haplotigs caused by sequence divergence in heterozygous regions." ... removes false dups, while retaining assembly completeness, improves scaffolding
    • Supplemental
    # === input/output variables
    pfs=*.pfs                # raw Pacbio read alignment PAF files
    asm=all_p_ctg.fasta      # primary assembly..um do I include mito and haplo here?
    
    # === Purge dups commands
    pbcstat $pfs       # will generate PB.base.cov and PB.stat
    calcuts PB.stat > cutoffs 2> calcults.log
    split_fa $asm > $asm.split.fa
    minimap2 -xasm5 -DP $asm.split.fa $asm.split.fa > $asm.split.self.paf
    purge_dups -2 -T cutoffs -c PB.base.cov $asm.split.self.paf > dups.bed 2> purge_dups.log
    get_seqs dups.bed $asm > purged.fa 2> hap.fa        # so it separates here..haplotigs sent to stderr?
    

2020 Merqury

2021 merfin, mitoVGP

  • Formenti, G., Rhie, A., Walenz, B.P., Thibaud-Nissen, F., Shafin, K., Koren, S., Myers, E.W., Jarvis, E.D. and Phillippy, A.M., 2021. Merfin: improved variant filtering and polishing via k-mer validation. bioRxiv.
  • Formenti, G., Rhie, A., Balacco, J., Haase, B., Mountcastle, J., Fedrigo, O., Brown, S., Capodiferro, M.R., Al-Ajli, F.O., Ambrosini, R. and Houde, P., 2021. Complete vertebrate mitogenomes reveal widespread repeats and gene duplications. Genome biology, 22(1), pp.1-22.
  • Rhie, A., McCarthy, S.A., Fedrigo, O., Damas, J., Formenti, G., Koren, S., Uliano-Silva, M., Chow, W., Fungtammasan, A., Kim, J. and Lee, C., 2021. Towards complete and error-free genome assemblies of all vertebrate species. Nature, 592(7856), pp.737-746.
    • "Genome heterozygosity posed additional problems, because homologous haplotypes in a diploid or polyploid genome are forced together into a single consensus by standard assemblers, sometimes creating false gene duplications."
    • Website: https://vertebrategenomesproject.org
    • "To our knowledge, this was the first systematic analysis of many sequence technologies, assembly algorithms, and assembly parameters applied on the same individual" heh, that would be fun
    • "After fixing a function in the PacBio FALCON software that caused artificial breaks in contigs between stretches of highly homozygous and heterozygous haplotype sequences (Supplementary Note 1, Table 2), ..." did we fix this as well?
    • VGP assembly pipeline (v1.0): haplotype-separated CLR contigs, scaffolding with linked reads, optical maps and Hi-C, gap filling, base call polishing, manual curation (extended data Figs 2a (polishing after scaffolding), 3a).
    • VGP assembly flowchart (Extended Data Fig 3): purge dups -> scaffold -> polish {arrow, longranger+FreeBayes, longranger+FreeBayes} "with binned reads" means reads by contig?

    FALCON and FALCON-Unzip were run with default parameters, except for computing the overlaps. Raw read overlaps were computed with DALIGNER parameters -k14 -e0.75 -s100 -l2500 -h240 -w8 to better reflect the higher error rate in early PacBio sequel I and II. Pread (preassembled read) overlaps were computed with DALIGNER parameters -k24 -e.90 -s100 -l1000 -h600 intending to collapse haplotypes for the FALCON step to better unzip genomes with high heterozygosity rate. FALCON-Unzip outputs both a pseudo-haplotype and a set of alternate haplotigs that represent the secondary alleles. We refer to these outputs as the primary contig set (c1) and alternate contig set (c2).

    To reduce these false duplications, we ran Purge_Haplotigs13, first during curation (VGP v1.0 pipeline) and then later after contig formation (VGP v1.5 pipeline). To do the former, Purge_Haplotigs was run on the primary contigs (c1), and identified haplotigs were mapped to the scaffolded primary assembly with MashMap286 for removal. In the latter, identified haplotigs were moved from the primary contigs (c1) to the alternate haplotig set (p2). The remaining primary contigs were referred to as p1; p2 combined with c2 was referred to as q2. Later, in the VGP v1.6 pipeline, we replaced Purge_Haplotigs with Purge_Dups14, a new program developed by several of the authors in response to Purge_Haplotigs not removing partial false duplication at contig boundaries. Purging also removes excessive low-coverage (junk) and high-coverage (repeats) contigs. To calculate the presence and overall success of purging false duplications, we used a k-mer approach (Supplementary Methods, Supplementary Fig. 6).

    To polish bases in both haplotypes with minimal alignment bias, we concatenated the alternate haplotig set (c2 in v1.0 or q2 in v1.5–1.6) to the scaffolded primary set (s3) and the assembled mitochondrial genome (mitoVGP in v1.6). We then performed another round of polishing with Arrow (smrtanalysis 5.1.0.26412) using PacBio CLR reads, aligning with pbalign --minAccuracy=0.75 --minLength=50 --minAnchorSize=12 --maxDivergence=30 –concordant --algorithm=blasr --algorithmOptions=--useQuality --maxHits=1 --hitPolicy=random --seed=1 and consensus polishing with variantCaller --skipUnrecognizedContigs haploid -x 5 -q 20 -X120 –v --algorithm=arrow. While this round of polishing resulted in higher QV for all genomes herein considered, we noticed that it was particularly sensitive to the coverage cutoff parameter (-x). This is because Arrow generates a de novo consensus from the mapped reads without explicitly considering the reference sequence. Later, we found that the second round of Arrow polishing sometimes reduced the QV accuracy for some species. Upon investigation, this issue was traced back to option -x 5, which requires at least 5 reads to call consensus. Such low minimum requirements can lead to uneven polishing in low coverage regions. To avoid this behaviour, we suggest to increase the -x close to the half sequence coverage (for example, 30× when 60× was used for assembly) and check QV before moving forward.

2021 ag100pest update, gEval

  • Childers, A.K., Geib, S.M., Sim, S.B., Poelchau, M.F., Coates, B.S., Simmonds, T.J., Scully, E.D., Smith, T.P., Childers, C.P., Corpuz, R.L. and Hackett, K., 2021. The USDA-ARS Ag100Pest Initiative: High-Quality Genome Assemblies for Agricultural Pest Arthropod Research. Insects, 12(7), p.626.
    • Figure 1: general workflow
    • Bioproject: https://www.ncbi.nlm.nih.gov/bioproject/555319
    • "Ag100Pest began by using continuous long reads (CLRs) for assembly (details not presented herein) as the improved HiFi procedure [33] had not yet been developed"
  • Howe, K., Chow, W., Collins, J., Pelan, S., Pointon, D.L., Sims, Y., Torrance, J., Tracey, A. and Wood, J., 2021. Significantly improving the quality of genome assemblies through curation. Gigascience, 10(1), p.giaa153.
    • gEVAL is a browser based method for evaluating quality of genome assemblies
    • "This is especially timely in the context of emerging projects that aim to assemble the genomes of very large numbers of species to highest quality possible, including the Vertebrate Genomes Project (VGP), the Darwin Tree of Life Project (DToL, darwintreeoflife.org), and the overarching Earth Biogenome Project [1, 10]."
    • "Before being loaded into gEVAL, all assemblies are run through a nextflow [38] pipeline that performs contamination detection and separation or removal as described in Table 1, combined with removal of trailing Ns [38]."

Online Videos

Clone this wiki locally