-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: reduce fgbio memory usage #296
Conversation
<!-- Ensure that the PR title follows conventional commit style (<type>: <description>)--> <!-- Possible types are here: https://github.com/commitizen/conventional-commit-types/blob/master/index.json --> In some cases sorting reads by `fgbio SortBam` is required as queryname sorting performed by `samtools` results in a different sorting order (see snakemake-workflows/dna-seq-varlociraptor#296). I therefore propose to add `fgbio-minimal` as an addition to the wrapper. ### QC <!-- Make sure that you can tick the boxes below. --> * [x] I confirm that: For all wrappers added by this PR, * there is a test case which covers any introduced changes, * `input:` and `output:` file paths in the resulting rule can be changed arbitrarily, * either the wrapper can only use a single core, or the example rule contains a `threads: x` statement with `x` being a reasonable default, * rule names in the test case are in [snake_case](https://en.wikipedia.org/wiki/Snake_case) and somehow tell what the rule is about or match the tools purpose or name (e.g., `map_reads` for a step that maps reads), * all `environment.yaml` specifications follow [the respective best practices](https://stackoverflow.com/a/64594513/2352071), * the `environment.yaml` pinning has been updated by running `snakedeploy pin-conda-envs environment.yaml` on a linux machine, * wherever possible, command line arguments are inferred and set automatically (e.g. based on file extensions in `input:` or `output:`), * all fields of the example rules in the `Snakefile`s and their entries are explained via comments (`input:`/`output:`/`params:` etc.), * `stderr` and/or `stdout` are logged correctly (`log:`), depending on the wrapped tool, * temporary files are either written to a unique hidden folder in the working directory, or (better) stored where the Python function `tempfile.gettempdir()` points to (see [here](https://docs.python.org/3/library/tempfile.html#tempfile.gettempdir); this also means that using any Python `tempfile` default behavior works), * the `meta.yaml` contains a link to the documentation of the respective tool or command, * `Snakefile`s pass the linting (`snakemake --lint`), * `Snakefile`s are formatted with [snakefmt](https://github.com/snakemake/snakefmt), * Python wrapper scripts are formatted with [black](https://black.readthedocs.io). * Conda environments use a minimal amount of channels, in recommended ordering. E.g. for bioconda, use (conda-forge, bioconda, nodefaults, as conda-forge should have highest priority and defaults channels are usually not needed because most packages are in conda-forge nowadays).
Can we not do the standard best practices and then do the adapter trimming once we have gone back to FASTQ files? So basically do the ubam loop before running cutadapt? |
And otherwise, I didn't fully understand the queryname sorting problem. Does And would |
I tried Also samtools collate will not help as the sorting between fastqs and bams must be identical. |
We might could do that but I do not see why it should be a best practice to transform fastq -> ubam -> fastq -> trimming. |
Oh, yeah. I totally agree that there doesn't seem to be an "elegant" solution here. But from looking at everything, I think the least redundant work to do, is if we simply do the best practices and only insert the cutadapt adapter trimming in step We might have to make sure that cutadapt does not fully discard reads, even if their sequence is completely trimmed. And I think BTW, many thanks for taking care of this -- I know it's a lot of hassle! |
I just sketched both options to get an impression what the workflow would look like if we consider that both options would require queryname sorted files. The only differences are that the best practice requires an additional step (transformation into ubam) but therefore we could use samtools for queryname sorting as we have only bam files for annotation. |
Many thanks for all the thoughts and sketches. I think I see mainly two good options, and I mention a third and why I don't think it is a good idea... 😅 A) query sort fastq, integrate with AnnotateBamWithUmisI think this option is actually closest to our current implementation, so would be minimally invasive with regard to the current workflow status. It's basically your last sketch, and I think the only necessary additions will be:
With this, we initially get two pathways starting from the raw fastq files:
Then, we integrate the query sorted fastq and bam with B) cutadapt as-is, followed by best practiceCould you double-check, that the UMIs get trimmed out by cutadapt in the Qiagen setup you quoted above? To me it really looks like the UMIs come before the Nextera Transposase 2 adapter, which I think should be the adapter to trim in this setup. So the UMIs should remain in the trimmed reads and then the regular best practices should work:
C) cutadapt injected into best practicesI think this is a non-option, see reasons at the end. But I do want to document thinking this through... 😅 If adapter trimming with cutadapt does indeed remove the UMIs in the Qiagen setting, we could instead inject it into the pipe in step 3. above. For this, we would need the
However, this has several downsides:
|
Looking at the changes, A) is pretty much what you ended up implementing. So let's stick with that solution for now, I'll review the current state. |
I would also advocate for solution A. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some minor naming things, and a suggestion to move the get_annotate_umis_params
function inline to the rule for further readability improvements. I think the semantic helper functions can help make it readable in-line. And the snakemake pipe functionality might solve the piping issue you mentioned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, this looks good to me, now. Thanks for trying all these different avenues, and let's hope this finally resolves this memory usage bottleneck for good!
<!-- Ensure that the PR title follows conventional commit style (<type>: <description>)--> <!-- Possible types are here: https://github.com/commitizen/conventional-commit-types/blob/master/index.json --> In some cases sorting reads by `fgbio SortBam` is required as queryname sorting performed by `samtools` results in a different sorting order (see snakemake-workflows/dna-seq-varlociraptor#296). I therefore propose to add `fgbio-minimal` as an addition to the wrapper. ### QC <!-- Make sure that you can tick the boxes below. --> * [x] I confirm that: For all wrappers added by this PR, * there is a test case which covers any introduced changes, * `input:` and `output:` file paths in the resulting rule can be changed arbitrarily, * either the wrapper can only use a single core, or the example rule contains a `threads: x` statement with `x` being a reasonable default, * rule names in the test case are in [snake_case](https://en.wikipedia.org/wiki/Snake_case) and somehow tell what the rule is about or match the tools purpose or name (e.g., `map_reads` for a step that maps reads), * all `environment.yaml` specifications follow [the respective best practices](https://stackoverflow.com/a/64594513/2352071), * the `environment.yaml` pinning has been updated by running `snakedeploy pin-conda-envs environment.yaml` on a linux machine, * wherever possible, command line arguments are inferred and set automatically (e.g. based on file extensions in `input:` or `output:`), * all fields of the example rules in the `Snakefile`s and their entries are explained via comments (`input:`/`output:`/`params:` etc.), * `stderr` and/or `stdout` are logged correctly (`log:`), depending on the wrapped tool, * temporary files are either written to a unique hidden folder in the working directory, or (better) stored where the Python function `tempfile.gettempdir()` points to (see [here](https://docs.python.org/3/library/tempfile.html#tempfile.gettempdir); this also means that using any Python `tempfile` default behavior works), * the `meta.yaml` contains a link to the documentation of the respective tool or command, * `Snakefile`s pass the linting (`snakemake --lint`), * `Snakefile`s are formatted with [snakefmt](https://github.com/snakemake/snakefmt), * Python wrapper scripts are formatted with [black](https://black.readthedocs.io). * Conda environments use a minimal amount of channels, in recommended ordering. E.g. for bioconda, use (conda-forge, bioconda, nodefaults, as conda-forge should have highest priority and defaults channels are usually not needed because most packages are in conda-forge nowadays).
For large samples
fgbio AnnotateBamWithUmis
fails due to memory issues.This happens as fgbio loads the fully uncompressed UMI fastq file into memory.
We already tried to fix this in a previous PR (#262) by dynamically assigning more memory based on input file size.
Still this does not suffice in many cases.
To get this fixed fgbio suggests a best practice where raw reads are transformed into unmapped bam files while annotating UMIs at the same time.
In our case this is not feasible as reads need to be adapter trimmed by cutadapt first.
But in case of Qiagen reads the UMI lies in front of the adapter sequence (source) and therefore can not be annotated while transforming the reads into ubam.
Alternatively,
fgbio AnnotateBamWithUmis
allows to input a queryname sorted bam and fastq file reading the UMIs sequentially from the fastq file, first.fastq files can be sorted by
fgbio SortFastq
. I intended to directly sort the mapped reads by queryname when runningmap_reads
but it showed that queryname sorting differs betweenfgbio
andsamtools
receiving incompatible files.So, it is necessary to reorder the mapped reads by
fgbio SortBam
before annotating UMIs.PS: Not sure if we treat this as fix or feature