Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: reduce fgbio memory usage #296

Merged
merged 7 commits into from
Apr 24, 2024
Merged

feat: reduce fgbio memory usage #296

merged 7 commits into from
Apr 24, 2024

Conversation

FelixMoelder
Copy link
Contributor

@FelixMoelder FelixMoelder commented Apr 11, 2024

For large samples fgbio AnnotateBamWithUmis fails due to memory issues.
This happens as fgbio loads the fully uncompressed UMI fastq file into memory.
We already tried to fix this in a previous PR (#262) by dynamically assigning more memory based on input file size.
Still this does not suffice in many cases.

To get this fixed fgbio suggests a best practice where raw reads are transformed into unmapped bam files while annotating UMIs at the same time.
In our case this is not feasible as reads need to be adapter trimmed by cutadapt first.
But in case of Qiagen reads the UMI lies in front of the adapter sequence (source) and therefore can not be annotated while transforming the reads into ubam.

Alternatively, fgbio AnnotateBamWithUmis allows to input a queryname sorted bam and fastq file reading the UMIs sequentially from the fastq file, first.
fastq files can be sorted by fgbio SortFastq. I intended to directly sort the mapped reads by queryname when running map_reads but it showed that queryname sorting differs between fgbio and samtools receiving incompatible files.
So, it is necessary to reorder the mapped reads by fgbio SortBam before annotating UMIs.

PS: Not sure if we treat this as fix or feature

@FelixMoelder FelixMoelder marked this pull request as ready for review April 12, 2024 06:19
fgvieira pushed a commit to snakemake/snakemake-wrappers that referenced this pull request Apr 12, 2024
<!-- Ensure that the PR title follows conventional commit style (<type>:
<description>)-->
<!-- Possible types are here:
https://github.com/commitizen/conventional-commit-types/blob/master/index.json
-->

In some cases sorting reads by `fgbio SortBam` is required as queryname
sorting performed by `samtools` results in a different sorting order
(see
snakemake-workflows/dna-seq-varlociraptor#296).
I therefore propose to add `fgbio-minimal` as an addition to the
wrapper.

### QC
<!-- Make sure that you can tick the boxes below. -->

* [x] I confirm that:

For all wrappers added by this PR, 

* there is a test case which covers any introduced changes,
* `input:` and `output:` file paths in the resulting rule can be changed
arbitrarily,
* either the wrapper can only use a single core, or the example rule
contains a `threads: x` statement with `x` being a reasonable default,
* rule names in the test case are in
[snake_case](https://en.wikipedia.org/wiki/Snake_case) and somehow tell
what the rule is about or match the tools purpose or name (e.g.,
`map_reads` for a step that maps reads),
* all `environment.yaml` specifications follow [the respective best
practices](https://stackoverflow.com/a/64594513/2352071),
* the `environment.yaml` pinning has been updated by running
`snakedeploy pin-conda-envs environment.yaml` on a linux machine,
* wherever possible, command line arguments are inferred and set
automatically (e.g. based on file extensions in `input:` or `output:`),
* all fields of the example rules in the `Snakefile`s and their entries
are explained via comments (`input:`/`output:`/`params:` etc.),
* `stderr` and/or `stdout` are logged correctly (`log:`), depending on
the wrapped tool,
* temporary files are either written to a unique hidden folder in the
working directory, or (better) stored where the Python function
`tempfile.gettempdir()` points to (see
[here](https://docs.python.org/3/library/tempfile.html#tempfile.gettempdir);
this also means that using any Python `tempfile` default behavior
works),
* the `meta.yaml` contains a link to the documentation of the respective
tool or command,
* `Snakefile`s pass the linting (`snakemake --lint`),
* `Snakefile`s are formatted with
[snakefmt](https://github.com/snakemake/snakefmt),
* Python wrapper scripts are formatted with
[black](https://black.readthedocs.io).
* Conda environments use a minimal amount of channels, in recommended
ordering. E.g. for bioconda, use (conda-forge, bioconda, nodefaults, as
conda-forge should have highest priority and defaults channels are
usually not needed because most packages are in conda-forge nowadays).
@dlaehnemann
Copy link
Contributor

Can we not do the standard best practices and then do the adapter trimming once we have gone back to FASTQ files? So basically do the ubam loop before running cutadapt?

@dlaehnemann
Copy link
Contributor

And otherwise, I didn't fully understand the queryname sorting problem. Does fgbio AnnotateBamWithUmis not allow for using samtools sort -n files? What's the exact issue there?

And would samtools collate be an alternative, as this would be quicker:

https://www.htslib.org/doc/samtools-collate.html

@FelixMoelder
Copy link
Contributor Author

FelixMoelder commented Apr 18, 2024

And otherwise, I didn't fully understand the queryname sorting problem. Does fgbio AnnotateBamWithUmis not allow for using samtools sort -n files? What's the exact issue there?

And would samtools collate be an alternative, as this would be quicker:

https://www.htslib.org/doc/samtools-collate.html

I tried samtools sort -n first. But it turned out that the lexicographic order of samtools is not the same as the order of fgbio SortFastq and therefore annotation fails.

Also samtools collate will not help as the sorting between fastqs and bams must be identical.

@FelixMoelder
Copy link
Contributor Author

FelixMoelder commented Apr 18, 2024

Can we not do the standard best practices and then do the adapter trimming once we have gone back to FASTQ files? So basically do the ubam loop before running cutadapt?

We might could do that but I do not see why it should be a best practice to transform fastq -> ubam -> fastq -> trimming.
Also we would need to write the annotated UMIs from the unmapped bam into the description fields of the fastqs again. And after mapping the umis need to be written to the bam record over again. All of this appears to me like a worst practice.

@dlaehnemann
Copy link
Contributor

Oh, yeah. I totally agree that there doesn't seem to be an "elegant" solution here. But from looking at everything, I think the least redundant work to do, is if we simply do the best practices and only insert the cutadapt adapter trimming in step 1.2 (after samtools fastq, before bwa mem):
https://github.com/fulcrumgenomics/fgbio/blob/main/docs/best-practice-consensus-pipeline.md#step-12-ubam---mapped-bam

We might have to make sure that cutadapt does not fully discard reads, even if their sequence is completely trimmed. And I think samtools fastq in this setup produces interleaved fastq, so we have to make sure cutadapt works properly in this scenario.

BTW, many thanks for taking care of this -- I know it's a lot of hassle!

@FelixMoelder
Copy link
Contributor Author

FelixMoelder commented Apr 19, 2024

I just spend a thought on how the "best practice" actually works.
What they do is to annotated the fastqs in the first step and output a ubam file.
This bam-file is then transformed to fastq once again and mapped against the reference.
The actual annotation is then performed by fgbio ZipperBams by inputing the annotated ubam and mapped bam file.
So, I do not see why the transformation of the ubam into a fastq is even necessary as one can just use the initial fastqs, map these and then perform annotation. In our case this would just be a separate step. I just did a quick drawing of the workflow (sorry for my handwriting; boxes are rules).
IMG_0010

In that case we would just skip the ubam -> fastq step and still could trim the initial fastqs.

But, is this any better than the current implementation?
The best practice adds two steps:

  • annotating umis (by creating ubams)
  • annotating umis to mapped reads (using the annotated ubam)

Our implementation also adds two steps:

  • query sorting initial fastqs
  • annotating bam by initial fastq

If I should guess I would assume that the best practice is a bit faster (annotation ubam vs query sorting) and probably also required less disk space (bam vs untrimmed fastqs). Disk space can probably be ignored as the output is just temporary.

Edit:
I just realized that ZipperBams also requires a queryname sorting:

Both the unmapped and mapped BAMs must be a) queryname sorted or grouped (i.e. all records with the same name are grouped together in the file), and b) have the same ordering of querynames. If either of these are violated the output is undefined!

This means we have the same issue as for the current implementation and will end up with additional steps for the best practice which makes it a bad practice. ;)

@FelixMoelder
Copy link
Contributor Author

I just sketched both options to get an impression what the workflow would look like if we consider that both options would require queryname sorted files.

Best Practice:
Best_Practice

Alternative:
Alternative

The only differences are that the best practice requires an additional step (transformation into ubam) but therefore we could use samtools for queryname sorting as we have only bam files for annotation.
The alternative does not require any file type transformation but as queryname sorting is not supported for fastq files by samtools we would need to do queryname sorting by fgbio for fastqs and during mapping.

@dlaehnemann
Copy link
Contributor

Many thanks for all the thoughts and sketches. I think I see mainly two good options, and I mention a third and why I don't think it is a good idea... 😅

A) query sort fastq, integrate with AnnotateBamWithUmis

I think this option is actually closest to our current implementation, so would be minimally invasive with regard to the current workflow status. It's basically your last sketch, and I think the only necessary additions will be:

  1. The option --sorted to AnnotateBamWithUmis.
  2. An added fgbio SortFastq rule/step to get a query name sorted fastq.
  3. An added fgbio SortBam rule/step to get a query name sorted bam.
  4. An added samtools sort rule/step to get a coordinate sorted bam at the end (we might pipe fgbio AnnotateBamWithUmis output directly into samtools sort to avoid one round of writing to disc).

With this, we initially get two pathways starting from the raw fastq files:

  1. fastq -> cutadapt -> bwa mem -> fgbio SortBam -> query sorted bam
  2. fastq -> fgbio SortFastq -> query sorted fastq

Then, we integrate the query sorted fastq and bam with fgbio AnnotateBamWithUmis -> samtools sort -> coordinate sorted bam.

B) cutadapt as-is, followed by best practice

Could you double-check, that the UMIs get trimmed out by cutadapt in the Qiagen setup you quoted above? To me it really looks like the UMIs come before the Nextera Transposase 2 adapter, which I think should be the adapter to trim in this setup. So the UMIs should remain in the trimmed reads and then the regular best practices should work:

  1. cutadapt -> trimmed fastqs
  2. fgbio FastqToBam -> unmapped bam
  3. samtools fastq | bwa mem -p | fgbio ZipperBams -> mapped bam
  4. fgbio GroupReadsByUmi -> mapped bam with UMI read group annotations (do we need this for deduplication?)

C) cutadapt injected into best practices

I think this is a non-option, see reasons at the end. But I do want to document thinking this through... 😅

If adapter trimming with cutadapt does indeed remove the UMIs in the Qiagen setting, we could instead inject it into the pipe in step 3. above. For this, we would need the --interleaved flag of cutadapt:

  1. fgbio FastqToBam -> unmapped bam
  2. samtools fastq | cutadapt --interleaved | bwa mem -p | fgbio ZipperBams -> mapped bam
  3. fgbio GroupReadsByUmi -> mapped bam with UMI read group annotations (do we need this for deduplication?)

However, this has several downsides:

  1. Standard fastqc cannot be run on the trimmed fastq, as the trimmed fastq does not get saved. If we do opt for saving it, we generate even more IO.
  2. The workflow wiring gets more complicated, because cutadapt gets run at different points in the workflow, depending on whether we need to annotate UMIs or not.

@dlaehnemann
Copy link
Contributor

Looking at the changes, A) is pretty much what you ended up implementing. So let's stick with that solution for now, I'll review the current state.

@FelixMoelder
Copy link
Contributor Author

I would also advocate for solution A.
Queryname sorting the bam via fgbio SortBam is not necessary as a distinct rule as I already updated the bwa mem snakemake-wrapper to support fgbio for sorting in addition to samtools and picard.
The only thing I probably should do is to rename the parameter function for fgbio AnnotateBamWithUmis and to replace the short sorting parameter by the long one for comprehensibility.

Copy link
Contributor

@dlaehnemann dlaehnemann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some minor naming things, and a suggestion to move the get_annotate_umis_params function inline to the rule for further readability improvements. I think the semantic helper functions can help make it readable in-line. And the snakemake pipe functionality might solve the piping issue you mentioned.

workflow/rules/common.smk Outdated Show resolved Hide resolved
workflow/rules/mapping.smk Show resolved Hide resolved
workflow/rules/common.smk Outdated Show resolved Hide resolved
Copy link
Contributor

@dlaehnemann dlaehnemann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, this looks good to me, now. Thanks for trying all these different avenues, and let's hope this finally resolves this memory usage bottleneck for good!

@dlaehnemann dlaehnemann merged commit 71611c8 into master Apr 24, 2024
7 checks passed
@dlaehnemann dlaehnemann deleted the feat/reduce_umi_memory branch April 24, 2024 13:14
votti pushed a commit to FroeseLab/snakemake-wrappers that referenced this pull request Jun 22, 2024
<!-- Ensure that the PR title follows conventional commit style (<type>:
<description>)-->
<!-- Possible types are here:
https://github.com/commitizen/conventional-commit-types/blob/master/index.json
-->

In some cases sorting reads by `fgbio SortBam` is required as queryname
sorting performed by `samtools` results in a different sorting order
(see
snakemake-workflows/dna-seq-varlociraptor#296).
I therefore propose to add `fgbio-minimal` as an addition to the
wrapper.

### QC
<!-- Make sure that you can tick the boxes below. -->

* [x] I confirm that:

For all wrappers added by this PR, 

* there is a test case which covers any introduced changes,
* `input:` and `output:` file paths in the resulting rule can be changed
arbitrarily,
* either the wrapper can only use a single core, or the example rule
contains a `threads: x` statement with `x` being a reasonable default,
* rule names in the test case are in
[snake_case](https://en.wikipedia.org/wiki/Snake_case) and somehow tell
what the rule is about or match the tools purpose or name (e.g.,
`map_reads` for a step that maps reads),
* all `environment.yaml` specifications follow [the respective best
practices](https://stackoverflow.com/a/64594513/2352071),
* the `environment.yaml` pinning has been updated by running
`snakedeploy pin-conda-envs environment.yaml` on a linux machine,
* wherever possible, command line arguments are inferred and set
automatically (e.g. based on file extensions in `input:` or `output:`),
* all fields of the example rules in the `Snakefile`s and their entries
are explained via comments (`input:`/`output:`/`params:` etc.),
* `stderr` and/or `stdout` are logged correctly (`log:`), depending on
the wrapped tool,
* temporary files are either written to a unique hidden folder in the
working directory, or (better) stored where the Python function
`tempfile.gettempdir()` points to (see
[here](https://docs.python.org/3/library/tempfile.html#tempfile.gettempdir);
this also means that using any Python `tempfile` default behavior
works),
* the `meta.yaml` contains a link to the documentation of the respective
tool or command,
* `Snakefile`s pass the linting (`snakemake --lint`),
* `Snakefile`s are formatted with
[snakefmt](https://github.com/snakemake/snakefmt),
* Python wrapper scripts are formatted with
[black](https://black.readthedocs.io).
* Conda environments use a minimal amount of channels, in recommended
ordering. E.g. for bioconda, use (conda-forge, bioconda, nodefaults, as
conda-forge should have highest priority and defaults channels are
usually not needed because most packages are in conda-forge nowadays).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants