Option to use Spades and multithreading for Bowtie2 and Spades #210

andreyto · 2018-02-01T20:24:26Z

Two main changes are proposed here:

Spades can be optionally used for the assembly. Fermi-light is still the default. More on the justification is below.
Bowtie2 and Spades will be automatically executed with multiple threads whenever idle threads would appear otherwise.
How and why is described in more details in the comments in the code. In short: assume that we have started ariba run
with 16 threads, and we have 20 clusters. At the end of the Pool.starmap call, we would have some single-thread calls
still running, with other threads in the pool staying idle because there is nothing else to do. The same would happen if
there are only, say, two clusters to begin with. The proposed change tracks the total number of remaining clusters
through a shared counter, and adaptively increases number of threads for Bowtie2 and Spades calls. At any moment, the
sum of used threads is guaranteed to never exceed the total allocated thread count (16 in our example). It should never
result in longer wall clock time than the original single-threaded implementation.

Why Spades is sometimes useful

I have been using Ariba a bit off-label, for extracting consensus sequences for target genes in WGS datasets in microbial
surveillance studies. There is not much interest in my case in the variants reported by the Ariba itself because we instead look
in a separate step at the differences in the consensus sequences across hundreds of isolates, essentially in an MSA. I
like your approach of recruiting reads to multiple alternative references and doing local de-novo assembly. We were able
to quickly extract various exotic truncated versions of the target genes that were otherwise difficult to handle with a
pure mapping-based approach. The default fermilight assembler worked fine with WGS data across many studies and genes.
We deployed the tool in our internal Galaxy instance.

Recently, I tried to push this line of Ariba use further and assemble a RSV virus amplicon. The data was from a PCR
amplification of contiguous chunk that spanned C-term of the G gene and all of the F gene, followed by Nextera library
construction and MiSeq 300x2 sequencing. RSV comes in two major subtypes, which are then classified further into genotypes,
with some genotypes having about 60 nt insertions in the G gene. The ability to supply alternative references is quite
useful in this case, and allows us splitting those samples where co-infection of A and B has occurred, and immediately
gives us subtype assignment. The reads in that dataset had extremely skewed coverage depth (often 30,000x at the F end, down
to 200x at the G). That partly probably had something to do with occasional incorrect primer binding, but large coverage
variations are generally typical for viral amplicon sequencing.

In this challenging dataset, fermilight just could not cope - it would often generate fragmented assemblies, even after
I would perform a digital normalization to even-out the coverage depth of the input reads.

Spades, on the other hand, was able to assemble full-length amplicons (and separate amplicons in A and B mixtures)
directly from the input reads without a digital normalization, if I was using the Single Cell mode (spades.py --sc).

So, I have re-integrated Spades into Ariba as an optional alternative to fermilight. I am quite sure that there are
going to be other challenging use cases where using a full-blown assembler like
Spades will make a decisive difference in the output quality, at the expense of longer runtimes.

I have deviated in a few places from your original Spades-related code:

There is a new option to ariba run called --spades_mode that allows selecting specialized variants of Spades such
as --rna or --sc. My code then picks reasonable other options to Spades based on the --spades_mode choice. I have
renamed your --spades_other_options into --spades_options in order to reflect the fact that if this argument is
provided by the Ariba user, it completely replaces default Spades options generated based on the --spades_mode choice.
from the Spades output, I use contigs rather than scaffolds that your code was using. Spades scaffolds contain runs on N as
spacers between the contigs, and my impression was that they would get into the final Ariba output and get treated like
real sequence. I might be mistaken on this point, though.

Cleanup tmp dirs before copytree in case the test harness is repeated in a directory after previous run has failed with unhandled exceptions.

…des and Bowtie2. Spades improves upon default fermilight on challenging datasets with highly uneven coverage (viral, amplicon, single-cell). Multithreading for Spades and Bowtie2 subprocesses is adaptive - kicks in at the end of a multiprocessing map run when idle threads were appearing otherwise (or in cases of overall fewer clusters than total threads).

martinghunt · 2018-02-02T09:33:25Z

@andreyto this looks great! Thanks for the clear explanations and adding tests.

Early on in development, the main reason for switching from spades to fermi-lite was actually the quality of the assemblies more than the slower run time. Sometimes spades would introduce strange errors, like false contig joins. Having said that, spades is continually being developed, is no doubt better now, and makes sense to have it as an option in ariba. We were also using GapFiller to remove the Ns, but yes as you say they would be a problem. Your approach of using the contigs is probably better.

Nice change to the multithreading as well :)

andreyto · 2018-02-12T21:09:36Z

@martinghunt would you be interested in another PR where I implemented a "plugin assembler" interface? I have defined and coded a CLI that an externally provided script must implement, in order to be plugged in as the internal assembler step into Ariba. In other words, instead of using say, Spades as implemented now, the user can use something else entirely. I see it as a useful feature for experimenting, as well as for rescuing those sequencing datasets that have lots of upstream problems. I only need to come up with a reasonably simple script for the test case. My currently implemented use case of a plugin script is a bit extreme in complexity, and relies on several programs from the BBTools as wells as Pilon polisher, in addition to Spades.

martinghunt · 2018-02-20T12:59:20Z

@andreyto sounds great! I don't really have time to maintain the code, but pull requests like your last one are always welcome.

Andrey Tovchigrechko added 3 commits February 1, 2018 13:17

Option to run subprocess w/o shell

601e86c

Remove reference to spades options in tests that are not using spades.

4fb0392

Cleanup tmp dirs before copytree in case the test harness is repeated in a directory after previous run has failed with unhandled exceptions.

andrewjpage requested a review from martinghunt February 1, 2018 20:46

martinghunt merged commit 8478708 into sanger-pathogens:master Feb 2, 2018

This was referenced Dec 6, 2018

Stopping! Signal received: 13 #238

Closed

Also getting Stopping! Signal received: 13 #249

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Option to use Spades and multithreading for Bowtie2 and Spades #210

Option to use Spades and multithreading for Bowtie2 and Spades #210

andreyto commented Feb 1, 2018

martinghunt commented Feb 2, 2018

andreyto commented Feb 12, 2018

martinghunt commented Feb 20, 2018

Option to use Spades and multithreading for Bowtie2 and Spades #210

Option to use Spades and multithreading for Bowtie2 and Spades #210

Conversation

andreyto commented Feb 1, 2018

Why Spades is sometimes useful

martinghunt commented Feb 2, 2018

andreyto commented Feb 12, 2018

martinghunt commented Feb 20, 2018