Skip to content
Christopher Dunn edited this page Apr 12, 2019 · 1 revision

Next steps:

  • Technical debt:
  1. The ipa-dovetail binary is an overgrown experimental piece of code which will be deprecated soon. I already extracted most of the components into individual Python scripts which are run in the workflow (it turned out to be faster for prototyping), and there is one more functionality the I need to make stand-alone before it's deprecated. That's the dovetail functionality, which basically extracts only dovetail overlaps, or near-dovetail (with allowed small overhang) which get augmented into dovetail (risking adding a few indels in the final assembly). This shouldn't take long, then the ipa-dovetail call can be removed.

  2. Pypeflow is used as a submodule but only because falcon_kit requires it and IPA uses the string graph and GFA modules of falcon_kit. We could either make a falcon_kit_core repo with all the non-Pypeflow code, or just copy the required scripts into IPA.

  3. There are some tests, more are needed.

  4. Eventually, it would be great to implement these scripts and the string graph modules in C++. Filtering scripts themselves are easy, SG might require plenty of work to get it right.- Features:

  5. Plasmid salvaging - not only for plasmids though. This can be done elegantly actually, and it will salvage plasmids, mitochondria, or GC-biased regions where DNA shearing produced smaller fragments. The idea: (I) after L1 raw read overlapping is done, do a linear pass over a list of input reads. For each read that's shorter than the seed_read_cutoff_length check if it has any overlaps with the seed reads (any read above the threshold). If it has 0 overlaps with seed reads, add it to the list. (II) Repeat overlapping between the salvaged reads and the entire DB, and merge with the rest of the overlaps.

  6. Graph-based polishing. This requires that raptor can read BAM as input. I can work on this while we test the contiguity, figure out the filters and test the parameters.

  7. Explore ideas for faster pread overlapping. Currently not a bottleneck for bacteria (error correction is still the bottleneck), but should be very useful for CCS assemblies in the future. (edited)

Clone this wiki locally