Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

easy-search with --cov-mode 1 -c 1 --greedy-best-hits drops exact matches #586

Open
milenkodespotovic opened this issue Aug 9, 2022 · 4 comments

Comments

@milenkodespotovic
Copy link

I've bumped into what appears to be a bug when using easy-search with the --greedy-best-hits flag and --cov-mode 1 -c 1, where hits containing targets that fully cover the query are being filtered out. I've outlined the behaviour in the minimal repro below:

Using file seq.fasta with content (the sequence is arbitrary):

>random_seq
NTLGEQNARIWSEREVGICLARKHPFLDFSVVKDMIATGLGTDYKALSPQ

and running easy-search with the above fasta as both the query and target:

mmseqs easy-search seq.fasta seq.fasta result.tsv tmp --cov-mode 1 -c 1 --greedy-best-hits produces an empty result.tsv

Running the above command without the --greedy-best-hits flag results in the expected result.tsv:

random_seq	random_seq	1.000	50	0	0	1	50	1	50	1.586E-38	108

My guess is that there is a bug in the summarizeresult command being run when --greedy-best-hits is passed in:
summarizeresult tmp/15248047069833069253/result tmp/15248047069833069253/result_best -a 1 --overlap 0 -c 1 --threads 8 --compressed 0 -v 3

But it's quite possible I'm missing something... thanks for your time!

Expected Behavior

A result.tsv with content

random_seq	random_seq	1.000	50	0	0	1	50	1	50	1.586E-38	108

Current Behavior

An empty result.tsv

Steps to Reproduce (for bugs)

echo ">random_seq\nNTLGEQNARIWSEREVGICLARKHPFLDFSVVKDMIATGLGTDYKALSPQ" > seq.fasta
mmseqs easy-search seq.fasta seq.fasta result.tsv tmp --cov-mode 1 -c 1 --greedy-best-hits

MMseqs Output (for bugs)

Create directory tmp
easy-search seq.fasta seq.fasta result.tsv tmp --cov-mode 1 -c 1 --greedy-best-hits 

MMseqs Version:                        	13.45111
Substitution matrix                    	nucl:nucleotide.out,aa:blosum62.out
Add backtrace                          	false
Alignment mode                         	3
Alignment mode                         	0
Allow wrapped scoring                  	false
E-value threshold                      	0.001
Seq. id. threshold                     	0
Min alignment length                   	0
Seq. id. mode                          	0
Alternative alignments                 	0
Coverage threshold                     	1
Coverage mode                          	1
Max sequence length                    	65535
Compositional bias                     	1
Max reject                             	2147483647
Max accept                             	2147483647
Include identical seq. id.             	false
Preload mode                           	0
Pseudo count a                         	1
Pseudo count b                         	1.5
Score bias                             	0
Realign hits                           	false
Realign score bias                     	-0.2
Realign max seqs                       	2147483647
Gap open cost                          	nucl:5,aa:11
Gap extension cost                     	nucl:2,aa:1
Zdrop                                  	40
Threads                                	8
Compressed                             	0
Verbosity                              	3
Seed substitution matrix               	nucl:nucleotide.out,aa:VTML80.out
Sensitivity                            	5.7
k-mer length                           	0
k-score                                	2147483647
Alphabet size                          	nucl:5,aa:21
Max results per query                  	300
Split database                         	0
Split mode                             	2
Split memory limit                     	0
Diagonal scoring                       	true
Exact k-mer matching                   	0
Mask residues                          	1
Mask lower case residues               	0
Minimum diagonal score                 	15
Spaced k-mers                          	1
Spaced k-mer pattern                   	
Local temporary path                   	
Rescore mode                           	0
Remove hits by seq. id. and coverage   	false
Sort results                           	0
Mask profile                           	1
Profile E-value threshold              	0.001
Global sequence weighting              	false
Allow deletions                        	false
Filter MSA                             	1
Maximum seq. id. threshold             	0.9
Minimum seq. id.                       	0
Minimum score per column               	-20
Minimum coverage                       	0
Select N most diverse seqs             	1000
Min codons in orf                      	30
Max codons in length                   	32734
Max orf gaps                           	2147483647
Contig start mode                      	2
Contig end mode                        	2
Orf start mode                         	1
Forward frames                         	1,2,3
Reverse frames                         	1,2,3
Translation table                      	1
Translate orf                          	0
Use all table starts                   	false
Offset of numeric ids                  	0
Create lookup                          	0
Add orf stop                           	false
Overlap between sequences              	0
Sequence split mode                    	1
Header split mode                      	0
Chain overlapping alignments           	0
Merge query                            	1
Search type                            	0
Search iterations                      	1
Start sensitivity                      	4
Search steps                           	1
Exhaustive search mode                 	false
Filter results during exhaustive search	0
Strand selection                       	1
LCA search mode                        	false
Disk space limit                       	0
MPI runner                             	
Force restart with latest tmp          	false
Remove temporary files                 	true
Alignment format                       	0
Format alignment output                	query,target,fident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,evalue,bits
Database output                        	false
Overlap threshold                      	0
Database type                          	0
Shuffle input database                 	true
Createdb mode                          	0
Write lookup file                      	0
Greedy best hits                       	true

Alignment backtraces will be computed, since they were requested by output format.
createdb seq.fasta tmp/15248047069833069253/query --dbtype 0 --shuffle 1 --createdb-mode 0 --write-lookup 0 --id-offset 0 --compressed 0 -v 3 

Converting sequences

Time for merging to query_h: 0h 0m 0s 9ms
Time for merging to query: 0h 0m 0s 8ms
Database type: Aminoacid
Time for processing: 0h 0m 0s 38ms
createdb seq.fasta tmp/15248047069833069253/target --dbtype 0 --shuffle 1 --createdb-mode 0 --write-lookup 0 --id-offset 0 --compressed 0 -v 3 

Converting sequences

Time for merging to target_h: 0h 0m 0s 9ms
Time for merging to target: 0h 0m 0s 8ms
Database type: Aminoacid
Time for processing: 0h 0m 0s 40ms
Create directory tmp/15248047069833069253/search_tmp
search tmp/15248047069833069253/query tmp/15248047069833069253/target tmp/15248047069833069253/result tmp/15248047069833069253/search_tmp -a 1 --alignment-mode 3 -c 1 --cov-mode 1 -s 5.7 --remove-tmp-files 1 

prefilter tmp/15248047069833069253/query tmp/15248047069833069253/target tmp/15248047069833069253/search_tmp/6923777973734096903/pref_0 --sub-mat nucl:nucleotide.out,aa:blosum62.out --seed-sub-mat nucl:nucleotide.out,aa:VTML80.out -k 0 --k-score 2147483647 --alph-size nucl:5,aa:21 --max-seq-len 65535 --max-seqs 300 --split 0 --split-mode 2 --split-memory-limit 0 -c 1 --cov-mode 1 --comp-bias-corr 1 --diag-score 1 --exact-kmer-matching 0 --mask 1 --mask-lower-case 0 --min-ungapped-score 15 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca 1 --pcb 1.5 --threads 8 --compressed 0 -v 3 -s 5.7 

Query database size: 1 type: Aminoacid
Estimated memory consumption: 977M
Target database size: 1 type: Aminoacid
Index table k-mer threshold: 112 at k-mer size 6 
Index table: counting k-mers
[=================================================================] 100.00% 1 eta -
Index table: Masked residues: 0
Index table: fill
[=================================================================] 100.00% 1 eta -
Index statistics
Entries:          41
DB size:          488 MB
Avg k-mer size:   0.000001
Top 10 k-mers
    DFVKIA	1
    ATLTKA	1
    SEEGLA	1
    IWEEIC	1
    PFDSKD	1
    CLRHLD	1
    DMAGTD	1
    QNRWRE	1
    GENRSE	1
    LAKPDF	1
Time for index table init: 0h 0m 0s 778ms
Process prefiltering step 1 of 1

k-mer similarity threshold: 112
Starting prefiltering scores calculation (step 1 of 1)
Query db start 1 to 1
Target db start 1 to 1
[=================================================================] 100.00% 1 eta -

602.620000 k-mers per position
41 DB matches per sequence
0 overflows
0 queries produce too many hits (truncated result)
1 sequences passed prefiltering per query sequence
1 median result list length
0 sequences with 0 size result lists
Time for merging to pref_0: 0h 0m 0s 0ms
Time for processing: 0h 0m 1s 740ms
align tmp/15248047069833069253/query tmp/15248047069833069253/target tmp/15248047069833069253/search_tmp/6923777973734096903/pref_0 tmp/15248047069833069253/result --sub-mat nucl:nucleotide.out,aa:blosum62.out -a 1 --alignment-mode 3 --alignment-output-mode 0 --wrapped-scoring 0 -e 0.001 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 1 --cov-mode 1 --max-seq-len 65535 --comp-bias-corr 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca 1 --pcb 1.5 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --gap-open nucl:5,aa:11 --gap-extend nucl:2,aa:1 --zdrop 40 --threads 8 --compressed 0 -v 3 

Compute score, coverage and sequence identity
Query database size: 1 type: Aminoacid
Target database size: 1 type: Aminoacid
Calculation of alignments
[=================================================================] 100.00% 1 eta -
Time for merging to result: 0h 0m 0s 0ms
1 alignments calculated
1 sequence pairs passed the thresholds (1.000000 of overall calculated)
1.000000 hits per query sequence
Time for processing: 0h 0m 0s 28ms
rmdb tmp/15248047069833069253/search_tmp/6923777973734096903/pref_0 -v 3 

Time for processing: 0h 0m 0s 1ms
rmdb tmp/15248047069833069253/search_tmp/6923777973734096903/aln_0 -v 3 

Time for processing: 0h 0m 0s 1ms
rmdb tmp/15248047069833069253/search_tmp/6923777973734096903/input_0 -v 3 

Time for processing: 0h 0m 0s 1ms
rmdb tmp/15248047069833069253/search_tmp/6923777973734096903/aln_merge -v 3 

Time for processing: 0h 0m 0s 1ms
summarizeresult tmp/15248047069833069253/result tmp/15248047069833069253/result_best -a 1 --overlap 0 -c 1 --threads 8 --compressed 0 -v 3 

[=================================================================] 100.00% 1 eta -
Time for merging to result_best: 0h 0m 0s 1ms
Time for processing: 0h 0m 0s 17ms
convertalis tmp/15248047069833069253/query tmp/15248047069833069253/target tmp/15248047069833069253/result_best result.tsv --sub-mat nucl:nucleotide.out,aa:blosum62.out --format-mode 0 --format-output query,target,fident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,evalue,bits --translation-table 1 --gap-open nucl:5,aa:11 --gap-extend nucl:2,aa:1 --db-output 0 --db-load-mode 0 --search-type 0 --threads 8 --compressed 0 -v 3 

[=================================================================] 100.00% 1 eta -
Time for merging to result.tsv: 0h 0m 0s 1ms
Time for processing: 0h 0m 0s 11ms
rmdb tmp/15248047069833069253/result_best -v 3 

Time for processing: 0h 0m 0s 2ms
rmdb tmp/15248047069833069253/result -v 3 

Time for processing: 0h 0m 0s 1ms
rmdb tmp/15248047069833069253/target -v 3 

Time for processing: 0h 0m 0s 1ms
rmdb tmp/15248047069833069253/target_h -v 3 

Time for processing: 0h 0m 0s 1ms
rmdb tmp/15248047069833069253/query -v 3 

Time for processing: 0h 0m 0s 2ms
rmdb tmp/15248047069833069253/query_h -v 3 

Context

See above.

Your Environment

Include as many relevant details about the environment you experienced the bug in.

  • Git commit used (The string after "MMseqs Version:" when you execute MMseqs without any parameters):
  • Which MMseqs version was used (Statically-compiled, self-compiled, Homebrew, etc.): 13.45111
  • For self-compiled and Homebrew: Compiler and Cmake versions used and their invocation:
  • Server specifications (especially CPU support for AVX2/SSE and amount of system memory):
  • Operating system and version:
@milot-mirdita
Copy link
Member

Thanks a lot! Makes sense, hits that match the coverage threshold should not be rejected.

@milot-mirdita
Copy link
Member

You can download the precompiled static binary that contains the fix (and latest git changes) at https://mmseqs.com/latest in about an hour when all tests pass through successfully.

@milenkodespotovic
Copy link
Author

Thank you so much for the incredibly fast fix!

@milenkodespotovic
Copy link
Author

@milot-mirdita out of curiosity, any plans for a release in the next while?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants