Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to use Spades and multithreading for Bowtie2 and Spades #210

Merged
merged 3 commits into from
Feb 2, 2018

Conversation

andreyto
Copy link

@andreyto andreyto commented Feb 1, 2018

Two main changes are proposed here:

  • Spades can be optionally used for the assembly. Fermi-light is still the default. More on the justification is below.
  • Bowtie2 and Spades will be automatically executed with multiple threads whenever idle threads would appear otherwise.
    How and why is described in more details in the comments in the code. In short: assume that we have started ariba run
    with 16 threads, and we have 20 clusters. At the end of the Pool.starmap call, we would have some single-thread calls
    still running, with other threads in the pool staying idle because there is nothing else to do. The same would happen if
    there are only, say, two clusters to begin with. The proposed change tracks the total number of remaining clusters
    through a shared counter, and adaptively increases number of threads for Bowtie2 and Spades calls. At any moment, the
    sum of used threads is guaranteed to never exceed the total allocated thread count (16 in our example). It should never
    result in longer wall clock time than the original single-threaded implementation.

Why Spades is sometimes useful

I have been using Ariba a bit off-label, for extracting consensus sequences for target genes in WGS datasets in microbial
surveillance studies. There is not much interest in my case in the variants reported by the Ariba itself because we instead look
in a separate step at the differences in the consensus sequences across hundreds of isolates, essentially in an MSA. I
like your approach of recruiting reads to multiple alternative references and doing local de-novo assembly. We were able
to quickly extract various exotic truncated versions of the target genes that were otherwise difficult to handle with a
pure mapping-based approach. The default fermilight assembler worked fine with WGS data across many studies and genes.
We deployed the tool in our internal Galaxy instance.

Recently, I tried to push this line of Ariba use further and assemble a RSV virus amplicon. The data was from a PCR
amplification of contiguous chunk that spanned C-term of the G gene and all of the F gene, followed by Nextera library
construction and MiSeq 300x2 sequencing. RSV comes in two major subtypes, which are then classified further into genotypes,
with some genotypes having about 60 nt insertions in the G gene. The ability to supply alternative references is quite
useful in this case, and allows us splitting those samples where co-infection of A and B has occurred, and immediately
gives us subtype assignment. The reads in that dataset had extremely skewed coverage depth (often 30,000x at the F end, down
to 200x at the G). That partly probably had something to do with occasional incorrect primer binding, but large coverage
variations are generally typical for viral amplicon sequencing.

In this challenging dataset, fermilight just could not cope - it would often generate fragmented assemblies, even after
I would perform a digital normalization to even-out the coverage depth of the input reads.

Spades, on the other hand, was able to assemble full-length amplicons (and separate amplicons in A and B mixtures)
directly from the input reads without a digital normalization, if I was using the Single Cell mode (spades.py --sc).

So, I have re-integrated Spades into Ariba as an optional alternative to fermilight. I am quite sure that there are
going to be other challenging use cases where using a full-blown assembler like
Spades will make a decisive difference in the output quality, at the expense of longer runtimes.

I have deviated in a few places from your original Spades-related code:

  • There is a new option to ariba run called --spades_mode that allows selecting specialized variants of Spades such
    as --rna or --sc. My code then picks reasonable other options to Spades based on the --spades_mode choice. I have
    renamed your --spades_other_options into --spades_options in order to reflect the fact that if this argument is
    provided by the Ariba user, it completely replaces default Spades options generated based on the --spades_mode choice.
  • from the Spades output, I use contigs rather than scaffolds that your code was using. Spades scaffolds contain runs on N as
    spacers between the contigs, and my impression was that they would get into the final Ariba output and get treated like
    real sequence. I might be mistaken on this point, though.

Andrey Tovchigrechko added 3 commits February 1, 2018 13:17
Cleanup tmp dirs before copytree in case the test harness is repeated
in a directory after previous run has failed with unhandled exceptions.
…des and Bowtie2.

Spades improves upon default fermilight on challenging datasets with highly uneven coverage
(viral, amplicon, single-cell).
Multithreading for Spades and Bowtie2 subprocesses is adaptive - kicks in at the end of
a multiprocessing map run when idle threads were appearing otherwise (or in cases of overall
fewer clusters than total threads).
@martinghunt
Copy link
Contributor

@andreyto this looks great! Thanks for the clear explanations and adding tests.

Early on in development, the main reason for switching from spades to fermi-lite was actually the quality of the assemblies more than the slower run time. Sometimes spades would introduce strange errors, like false contig joins. Having said that, spades is continually being developed, is no doubt better now, and makes sense to have it as an option in ariba. We were also using GapFiller to remove the Ns, but yes as you say they would be a problem. Your approach of using the contigs is probably better.

Nice change to the multithreading as well :)

@martinghunt martinghunt merged commit 8478708 into sanger-pathogens:master Feb 2, 2018
@andreyto
Copy link
Author

@martinghunt would you be interested in another PR where I implemented a "plugin assembler" interface? I have defined and coded a CLI that an externally provided script must implement, in order to be plugged in as the internal assembler step into Ariba. In other words, instead of using say, Spades as implemented now, the user can use something else entirely. I see it as a useful feature for experimenting, as well as for rescuing those sequencing datasets that have lots of upstream problems. I only need to come up with a reasonably simple script for the test case. My currently implemented use case of a plugin script is a bit extreme in complexity, and relies on several programs from the BBTools as wells as Pilon polisher, in addition to Spades.

@martinghunt
Copy link
Contributor

@andreyto sounds great! I don't really have time to maintain the code, but pull requests like your last one are always welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants