feat: reduce fgbio memory usage #296

FelixMoelder · 2024-04-11T09:46:58Z

For large samples fgbio AnnotateBamWithUmis fails due to memory issues.
This happens as fgbio loads the fully uncompressed UMI fastq file into memory.
We already tried to fix this in a previous PR (#262) by dynamically assigning more memory based on input file size.
Still this does not suffice in many cases.

To get this fixed fgbio suggests a best practice where raw reads are transformed into unmapped bam files while annotating UMIs at the same time.
In our case this is not feasible as reads need to be adapter trimmed by cutadapt first.
But in case of Qiagen reads the UMI lies in front of the adapter sequence (source) and therefore can not be annotated while transforming the reads into ubam.

Alternatively, fgbio AnnotateBamWithUmis allows to input a queryname sorted bam and fastq file reading the UMIs sequentially from the fastq file, first.
fastq files can be sorted by fgbio SortFastq. I intended to directly sort the mapped reads by queryname when running map_reads but it showed that queryname sorting differs between fgbio and samtools receiving incompatible files.
So, it is necessary to reorder the mapped reads by fgbio SortBam before annotating UMIs.

PS: Not sure if we treat this as fix or feature

In some cases sorting reads by `fgbio SortBam` is required as queryname sorting performed by `samtools` results in a different sorting order (see snakemake-workflows/dna-seq-varlociraptor#296). I therefore propose to add `fgbio-minimal` as an addition to the wrapper. ### QC  * [x] I confirm that: For all wrappers added by this PR, * there is a test case which covers any introduced changes, * `input:` and `output:` file paths in the resulting rule can be changed arbitrarily, * either the wrapper can only use a single core, or the example rule contains a `threads: x` statement with `x` being a reasonable default, * rule names in the test case are in [snake_case](https://en.wikipedia.org/wiki/Snake_case) and somehow tell what the rule is about or match the tools purpose or name (e.g., `map_reads` for a step that maps reads), * all `environment.yaml` specifications follow [the respective best practices](https://stackoverflow.com/a/64594513/2352071), * the `environment.yaml` pinning has been updated by running `snakedeploy pin-conda-envs environment.yaml` on a linux machine, * wherever possible, command line arguments are inferred and set automatically (e.g. based on file extensions in `input:` or `output:`), * all fields of the example rules in the `Snakefile`s and their entries are explained via comments (`input:`/`output:`/`params:` etc.), * `stderr` and/or `stdout` are logged correctly (`log:`), depending on the wrapped tool, * temporary files are either written to a unique hidden folder in the working directory, or (better) stored where the Python function `tempfile.gettempdir()` points to (see [here](https://docs.python.org/3/library/tempfile.html#tempfile.gettempdir); this also means that using any Python `tempfile` default behavior works), * the `meta.yaml` contains a link to the documentation of the respective tool or command, * `Snakefile`s pass the linting (`snakemake --lint`), * `Snakefile`s are formatted with [snakefmt](https://github.com/snakemake/snakefmt), * Python wrapper scripts are formatted with [black](https://black.readthedocs.io). * Conda environments use a minimal amount of channels, in recommended ordering. E.g. for bioconda, use (conda-forge, bioconda, nodefaults, as conda-forge should have highest priority and defaults channels are usually not needed because most packages are in conda-forge nowadays).

workflow/rules/mapping.smk

dlaehnemann · 2024-04-18T07:29:44Z

Can we not do the standard best practices and then do the adapter trimming once we have gone back to FASTQ files? So basically do the ubam loop before running cutadapt?

dlaehnemann · 2024-04-18T07:34:56Z

And otherwise, I didn't fully understand the queryname sorting problem. Does fgbio AnnotateBamWithUmis not allow for using samtools sort -n files? What's the exact issue there?

And would samtools collate be an alternative, as this would be quicker:

https://www.htslib.org/doc/samtools-collate.html

FelixMoelder · 2024-04-18T11:21:20Z

And otherwise, I didn't fully understand the queryname sorting problem. Does fgbio AnnotateBamWithUmis not allow for using samtools sort -n files? What's the exact issue there?

And would samtools collate be an alternative, as this would be quicker:

https://www.htslib.org/doc/samtools-collate.html

I tried samtools sort -n first. But it turned out that the lexicographic order of samtools is not the same as the order of fgbio SortFastq and therefore annotation fails.

Also samtools collate will not help as the sorting between fastqs and bams must be identical.

FelixMoelder · 2024-04-18T11:29:04Z

Can we not do the standard best practices and then do the adapter trimming once we have gone back to FASTQ files? So basically do the ubam loop before running cutadapt?

We might could do that but I do not see why it should be a best practice to transform fastq -> ubam -> fastq -> trimming.
Also we would need to write the annotated UMIs from the unmapped bam into the description fields of the fastqs again. And after mapping the umis need to be written to the bam record over again. All of this appears to me like a worst practice.

dlaehnemann · 2024-04-18T13:44:24Z

Oh, yeah. I totally agree that there doesn't seem to be an "elegant" solution here. But from looking at everything, I think the least redundant work to do, is if we simply do the best practices and only insert the cutadapt adapter trimming in step 1.2 (after samtools fastq, before bwa mem):
https://github.com/fulcrumgenomics/fgbio/blob/main/docs/best-practice-consensus-pipeline.md#step-12-ubam---mapped-bam

We might have to make sure that cutadapt does not fully discard reads, even if their sequence is completely trimmed. And I think samtools fastq in this setup produces interleaved fastq, so we have to make sure cutadapt works properly in this scenario.

BTW, many thanks for taking care of this -- I know it's a lot of hassle!

FelixMoelder · 2024-04-19T07:33:56Z

I just spend a thought on how the "best practice" actually works.
What they do is to annotated the fastqs in the first step and output a ubam file.
This bam-file is then transformed to fastq once again and mapped against the reference.
The actual annotation is then performed by fgbio ZipperBams by inputing the annotated ubam and mapped bam file.
So, I do not see why the transformation of the ubam into a fastq is even necessary as one can just use the initial fastqs, map these and then perform annotation. In our case this would just be a separate step. I just did a quick drawing of the workflow (sorry for my handwriting; boxes are rules).

In that case we would just skip the ubam -> fastq step and still could trim the initial fastqs.

But, is this any better than the current implementation?
The best practice adds two steps:

annotating umis (by creating ubams)
annotating umis to mapped reads (using the annotated ubam)

Our implementation also adds two steps:

query sorting initial fastqs
annotating bam by initial fastq

If I should guess I would assume that the best practice is a bit faster (annotation ubam vs query sorting) and probably also required less disk space (bam vs untrimmed fastqs). Disk space can probably be ignored as the output is just temporary.

Edit:
I just realized that ZipperBams also requires a queryname sorting:

Both the unmapped and mapped BAMs must be a) queryname sorted or grouped (i.e. all records with the same name are grouped together in the file), and b) have the same ordering of querynames. If either of these are violated the output is undefined!

This means we have the same issue as for the current implementation and will end up with additional steps for the best practice which makes it a bad practice. ;)

FelixMoelder · 2024-04-19T08:51:57Z

I just sketched both options to get an impression what the workflow would look like if we consider that both options would require queryname sorted files.

Best Practice:

Alternative:

The only differences are that the best practice requires an additional step (transformation into ubam) but therefore we could use samtools for queryname sorting as we have only bam files for annotation.
The alternative does not require any file type transformation but as queryname sorting is not supported for fastq files by samtools we would need to do queryname sorting by fgbio for fastqs and during mapping.

dlaehnemann · 2024-04-19T15:52:28Z

Many thanks for all the thoughts and sketches. I think I see mainly two good options, and I mention a third and why I don't think it is a good idea... 😅

A) query sort fastq, integrate with AnnotateBamWithUmis

I think this option is actually closest to our current implementation, so would be minimally invasive with regard to the current workflow status. It's basically your last sketch, and I think the only necessary additions will be:

The option --sorted to AnnotateBamWithUmis.
An added fgbio SortFastq rule/step to get a query name sorted fastq.
An added fgbio SortBam rule/step to get a query name sorted bam.
An added samtools sort rule/step to get a coordinate sorted bam at the end (we might pipe fgbio AnnotateBamWithUmis output directly into samtools sort to avoid one round of writing to disc).

With this, we initially get two pathways starting from the raw fastq files:

fastq -> cutadapt -> bwa mem -> fgbio SortBam -> query sorted bam
fastq -> fgbio SortFastq -> query sorted fastq

Then, we integrate the query sorted fastq and bam with fgbio AnnotateBamWithUmis -> samtools sort -> coordinate sorted bam.

B) cutadapt as-is, followed by best practice

Could you double-check, that the UMIs get trimmed out by cutadapt in the Qiagen setup you quoted above? To me it really looks like the UMIs come before the Nextera Transposase 2 adapter, which I think should be the adapter to trim in this setup. So the UMIs should remain in the trimmed reads and then the regular best practices should work:

cutadapt -> trimmed fastqs
fgbio FastqToBam -> unmapped bam
samtools fastq | bwa mem -p | fgbio ZipperBams -> mapped bam
fgbio GroupReadsByUmi -> mapped bam with UMI read group annotations (do we need this for deduplication?)

C) cutadapt injected into best practices

I think this is a non-option, see reasons at the end. But I do want to document thinking this through... 😅

If adapter trimming with cutadapt does indeed remove the UMIs in the Qiagen setting, we could instead inject it into the pipe in step 3. above. For this, we would need the --interleaved flag of cutadapt:

fgbio FastqToBam -> unmapped bam
samtools fastq | cutadapt --interleaved | bwa mem -p | fgbio ZipperBams -> mapped bam
fgbio GroupReadsByUmi -> mapped bam with UMI read group annotations (do we need this for deduplication?)

However, this has several downsides:

Standard fastqc cannot be run on the trimmed fastq, as the trimmed fastq does not get saved. If we do opt for saving it, we generate even more IO.
The workflow wiring gets more complicated, because cutadapt gets run at different points in the workflow, depending on whether we need to annotate UMIs or not.

dlaehnemann · 2024-04-19T16:09:04Z

Looking at the changes, A) is pretty much what you ended up implementing. So let's stick with that solution for now, I'll review the current state.

FelixMoelder · 2024-04-22T06:00:07Z

I would also advocate for solution A.
Queryname sorting the bam via fgbio SortBam is not necessary as a distinct rule as I already updated the bwa mem snakemake-wrapper to support fgbio for sorting in addition to samtools and picard.
The only thing I probably should do is to rename the parameter function for fgbio AnnotateBamWithUmis and to replace the short sorting parameter by the long one for comprehensibility.

dlaehnemann

Just some minor naming things, and a suggestion to move the get_annotate_umis_params function inline to the rule for further readability improvements. I think the semantic helper functions can help make it readable in-line. And the snakemake pipe functionality might solve the piping issue you mentioned.

workflow/rules/common.smk

workflow/rules/mapping.smk

workflow/rules/common.smk

dlaehnemann

OK, this looks good to me, now. Thanks for trying all these different avenues, and let's hope this finally resolves this memory usage bottleneck for good!

In some cases sorting reads by `fgbio SortBam` is required as queryname sorting performed by `samtools` results in a different sorting order (see snakemake-workflows/dna-seq-varlociraptor#296). I therefore propose to add `fgbio-minimal` as an addition to the wrapper. ### QC  * [x] I confirm that: For all wrappers added by this PR, * there is a test case which covers any introduced changes, * `input:` and `output:` file paths in the resulting rule can be changed arbitrarily, * either the wrapper can only use a single core, or the example rule contains a `threads: x` statement with `x` being a reasonable default, * rule names in the test case are in [snake_case](https://en.wikipedia.org/wiki/Snake_case) and somehow tell what the rule is about or match the tools purpose or name (e.g., `map_reads` for a step that maps reads), * all `environment.yaml` specifications follow [the respective best practices](https://stackoverflow.com/a/64594513/2352071), * the `environment.yaml` pinning has been updated by running `snakedeploy pin-conda-envs environment.yaml` on a linux machine, * wherever possible, command line arguments are inferred and set automatically (e.g. based on file extensions in `input:` or `output:`), * all fields of the example rules in the `Snakefile`s and their entries are explained via comments (`input:`/`output:`/`params:` etc.), * `stderr` and/or `stdout` are logged correctly (`log:`), depending on the wrapped tool, * temporary files are either written to a unique hidden folder in the working directory, or (better) stored where the Python function `tempfile.gettempdir()` points to (see [here](https://docs.python.org/3/library/tempfile.html#tempfile.gettempdir); this also means that using any Python `tempfile` default behavior works), * the `meta.yaml` contains a link to the documentation of the respective tool or command, * `Snakefile`s pass the linting (`snakemake --lint`), * `Snakefile`s are formatted with [snakefmt](https://github.com/snakemake/snakefmt), * Python wrapper scripts are formatted with [black](https://black.readthedocs.io). * Conda environments use a minimal amount of channels, in recommended ordering. E.g. for bioconda, use (conda-forge, bioconda, nodefaults, as conda-forge should have highest priority and defaults channels are usually not needed because most packages are in conda-forge nowadays).

FelixMoelder added 3 commits April 11, 2024 11:23

feat: reduce fgbio memory usage

2e3a472

formatting

876d181

formatting

d50a713

FelixMoelder requested review from dlaehnemann and johanneskoester April 12, 2024 06:19

FelixMoelder marked this pull request as ready for review April 12, 2024 06:19

FelixMoelder mentioned this pull request Apr 12, 2024

feat: add fgbio sorting to bwa mem snakemake/snakemake-wrappers#2805

Merged

1 task

replace rule by updated wrapper

8b343fb

FelixMoelder commented Apr 12, 2024

View reviewed changes

workflow/rules/mapping.smk Show resolved Hide resolved

update wrapper

39c3b79

rename function

438053b

dlaehnemann reviewed Apr 22, 2024

View reviewed changes

workflow/rules/common.smk Outdated Show resolved Hide resolved

workflow/rules/mapping.smk Show resolved Hide resolved

workflow/rules/common.smk Outdated Show resolved Hide resolved

add piping

fd4a0e9

dlaehnemann approved these changes Apr 24, 2024

View reviewed changes

dlaehnemann merged commit 71611c8 into master Apr 24, 2024
7 checks passed

dlaehnemann deleted the feat/reduce_umi_memory branch April 24, 2024 13:14

github-actions bot mentioned this pull request Apr 24, 2024

chore(master): release 5.6.0 #298

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: reduce fgbio memory usage #296

feat: reduce fgbio memory usage #296

FelixMoelder commented Apr 11, 2024 •

edited

Loading

dlaehnemann commented Apr 18, 2024

dlaehnemann commented Apr 18, 2024

FelixMoelder commented Apr 18, 2024 •

edited

Loading

FelixMoelder commented Apr 18, 2024 •

edited

Loading

dlaehnemann commented Apr 18, 2024

FelixMoelder commented Apr 19, 2024 •

edited

Loading

FelixMoelder commented Apr 19, 2024

dlaehnemann commented Apr 19, 2024

dlaehnemann commented Apr 19, 2024

FelixMoelder commented Apr 22, 2024

dlaehnemann left a comment

dlaehnemann left a comment

feat: reduce fgbio memory usage #296

feat: reduce fgbio memory usage #296

Conversation

FelixMoelder commented Apr 11, 2024 • edited Loading

dlaehnemann commented Apr 18, 2024

dlaehnemann commented Apr 18, 2024

FelixMoelder commented Apr 18, 2024 • edited Loading

FelixMoelder commented Apr 18, 2024 • edited Loading

dlaehnemann commented Apr 18, 2024

FelixMoelder commented Apr 19, 2024 • edited Loading

FelixMoelder commented Apr 19, 2024

dlaehnemann commented Apr 19, 2024

A) query sort fastq, integrate with AnnotateBamWithUmis

B) cutadapt as-is, followed by best practice

C) cutadapt injected into best practices

dlaehnemann commented Apr 19, 2024

FelixMoelder commented Apr 22, 2024

dlaehnemann left a comment

Choose a reason for hiding this comment

dlaehnemann left a comment

Choose a reason for hiding this comment

FelixMoelder commented Apr 11, 2024 •

edited

Loading

FelixMoelder commented Apr 18, 2024 •

edited

Loading

FelixMoelder commented Apr 18, 2024 •

edited

Loading

FelixMoelder commented Apr 19, 2024 •

edited

Loading