Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can sourmash be used to remove microbal contamination from long reads? #3291

Open
spoonbender76 opened this issue Aug 15, 2024 · 0 comments
Open

Comments

@spoonbender76
Copy link

Hi,

I recently discovered sourmash from a benchmark study (Portik et al. 2022) and tested it myself. It's very fast and memory-effiecient.

I tried to use sourmash to remove microbal contamination from long reads but failed. My data is a pacbio hifi WGS for de novo assembly (SRR28491883). Since the target species is an insect, microbes like symbionts could not be removed completely before sequencing, which is showed in SRA taxonomy analysis.

I used the commands from #3095 and #3252.

# sourmash 4.8.11 
# sourmash_plugin_branchwater 0.9.7

# Nm_hifi.csv file for sketch 
# Nm_hifi.fasta is converted by bam2fasta in pbtk
cat Nm_hifi.csv
name,genome_filename,protein_filename
Nm_hifi,Nm_hifi.fasta,

# manysketch with --singleton
sourmash scripts manysketch Nm_hifi.csv -o Nm_hifi.manysketch.k31.singleton.zip -p dna,k=31,scaled=1000,abund --singleton
sourmash scripts index -k 31 gtdb-rs214-k31.zip -o gtdb-rs214-k31.rocksdb
sourmash scripts fastmultigather -k 31 -o Nm_hifi.manysketch.k31.singleton.fastmultigather.gtdb-rs214.csv Nm_hifi.manysketch.k31.singleton.zip gtdb-rs214-k31.rocksdb

I expected it to be a large file(taxomony assignment for 3.4M reads). However, fastmultigather finished within a minute and there was only one line of result in the output csv file.
Nm_hifi.manysketch.k31.singleton.fastmultigather.gtdb-rs214.csv

# manysketch without --singleton
sourmash scripts manysketch Nm_hifi.csv -o Nm_hifi.manysketch.k31.zip -p dna,k=31,scaled=1000,abund
sourmash scripts fastgather -k 31 -o Nm_hifi.manysketch.k31.fastgather.gtdb-rs214.csv Nm_hifi.manysketch.k31.zip gtdb-rs214-k31.zip

Then I tried manysketch without --singleton and used fastgather the output file seems normal.
Nm_hifi.manysketch.k31.fastgather.gtdb-rs214.csv

I also tried genbank-2022.03 databases and with -k 51. The output for manysketch with --singletons and fastmultigather either only one line(for k31 genbank bacteria) or empty(for all k51 databases) but for manysketch without --singletons and fastgather the output seems normal.

I'm confused about what's going wrong here. Can sourmash perform taxonomy assignment for long reads? Or is this unexpected behavior due to some specific commands/parameters I'm using?

To quickly reproduce this issue, I randomly sampled 10k reads from Nm_hifi.fasta(Nm_hifi_sample10k.fasta.gz on google drive).

gzip -d Nm_hifi_sample10k.fasta.gz
echo -e "name,genome_filename,protein_filename\nNm_hifi_sample10k,Nm_hifi_sample10k.fasta," > Nm_hifi_sample10k.csv
sourmash scripts manysketch Nm_hifi_sample10k.csv -o Nm_hifi_sample10k.manysketch.k31.singleton.zip -p dna,k=31,scaled=1000,abund --singleton
sourmash scripts fastmultigather -k 31 -o Nm_hifi_sample10k.manysketch.k31.singleton.fastmultigather.gtdb-rs214.csv Nm_hifi_sample10k.manysketch.k31.singleton.zip gtdb-rs214-k31.rocksdb
sourmash scripts manysketch Nm_hifi_sample10k.csv -o Nm_hifi_sample10k.manysketch.k31.zip -p dna,k=31,scaled=1000,abund
sourmash scripts fastgather -k 31 -o Nm_hifi_sample10k.manysketch.k31.fastgather.gtdb-rs214.csv Nm_hifi_sample10k.manysketch.k31.zip gtdb-rs214-k31.zip

There are three lines of results in the fastgather output but the fastmultigather output is empty.
Nm_hifi_sample10k.manysketch.k31.fastgather.gtdb-rs214.csv
Nm_hifi_sample10k.manysketch.k31.singleton.fastmultigather.gtdb-rs214.csv

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant