Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible changes to fastq input #52

Open
johnlees opened this issue Aug 14, 2023 · 8 comments
Open

Possible changes to fastq input #52

johnlees opened this issue Aug 14, 2023 · 8 comments
Assignees
Labels
enhancement New feature or request

Comments

@johnlees
Copy link
Member

Two possible additions

An equivalent to the -m option in ska1:

Finally, the base call for the middle base in the split kmer is filtered to remove any bases where the minor alleles are found in more than 20% of the kmers. This is a strategy often used in read mapping to account for sequencing error, and can be adjusted using the -m option.

We can filter on ambiguous, but not frequency of ambig.

Early stop/fast mode

This could probably made to go much faster by:

  1. Stopping reading the input after n reads (set by e.g. 15x coverage)
  2. For 'online' or ref-based analyses, only including k-mers in the ref and turning off filters (may need minimiser based bins to improve caching).

For now, the first of these could be a useful addition

@johnlees johnlees added the enhancement New feature or request label Aug 14, 2023
@johnlees johnlees self-assigned this Aug 14, 2023
@johnlees
Copy link
Member Author

Another thought, a fastcall mode. Give a ref (.fa or .skf) and call SNPs against it given either a coverage or CPU time threshold.

As there will be no error correction makes sense to have a counter for each of four middle bases and then do a simple model to generate the call (e.g. [1, 0, 0, 5] -> G; [1, 3, 4, 0] -> Y. Probably looks like the bcftools model, feels multinomial to me). Or just keep the BBF to filter singletons (include middle base in hash)

@johnlees
Copy link
Member Author

Important point from @rderelle from his work on fastlin: many fastq files in SRA come from BAMs, so are sorted and you cannot take the first n reads as a random subsample. Checking for subsequent reads having the same minimiser would probably catch this in most cases.

@rderelle
Copy link

rderelle commented Aug 29, 2023

I might have a suggestion regarding the fast mode (codename: 'flyover'):

(1) to parse the first fastq file and only extract kmers from 1 read every 20 reads.
(2) at the end of this first pass, if the average coverage of the split-kmers is above the cutoff (e.g. 80X), then SKA can end the analysis ... the runtime has been roughly divided by 20.
(3) if not, SKA could calculate how many more 20th are needed to reach the coverage cutoff (if 2 fastq files, SKA could reasonably assume that the number of reads in both files are the same) -> second pass with for instance extracting kmers from the 2nd, 3rd and 4th reads every 20 reads if 15% more reads are needed (in this case the runtime has been roughly divided by 5)

This approach would be able to analyse all types of fastq files (sorted or not). However, it would only make sense if the algorithm used to parse the fastq files is ultra-fast as it would require 2 successive parsing in most cases (e.g., 'seq_io' as in fastlin).

Edit: the examples mentioned above are probably not realistic (samples with very high coverages) since this approach should only divide the runtimes by a factor 3-4 at the most (depending on the cutoff).

@johnlees
Copy link
Member Author

Thanks for the suggestion! Here's some profiling I did with fastq input (with the older countmin filter, which is a bit slower than the current implementation):
image
The bloom filter takes around ~75% of the time when reading a file in, file reading is about 8% of the time (mostly spent on decompression with .fq.gz input, but when I've tested previously decompressing doesn't usually make sense because the disk access takes proportionally longer), the k-mer processing is around 7% of the time, dictionary adds are minimal.

So the reading is actually very fast if we skip the parsing, probably around 10-15x faster than current run, which might make this viable.

Finally, I think the best solution would be to check for sorted reads with minimisers, in which case you can just do one run, but if sorted then subsample

@johnlees
Copy link
Member Author

johnlees commented Aug 30, 2023

Full profile if you are interested (open in firefox)
flamegraph_cm_bb

@rderelle
Copy link

The minimisers approach seems simpler to implement but, for the sake of the discussion (I'm not advocating against it), it might suffer from two caveats:
_ you might need to introduce another cutoff for the comparison of the minimisers and decide if reads are sorted or not (not entirely sure about this point as I have never used minimisers myself)
_ if SKA concludes that reads are sorted, it would not be able to uniformly subsample along the fastq files without knowing a priori how many reads there are.

May I ask what profiling program did you use to generate these profiles? :)

@johnlees
Copy link
Member Author

May I ask what profiling program did you use to generate these profiles? :)

It's flamegraph, which is really nice

@johnlees
Copy link
Member Author

For mapping to a reference with reads, using a binary fuse filter for the ref knots could be a good choice. See https://docs.rs/xorf/latest/xorf/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants