`easy-search` with `--cov-mode 1 -c 1 --greedy-best-hits` drops exact matches #586

milenkodespotovic · 2022-08-09T05:13:02Z

I've bumped into what appears to be a bug when using easy-search with the --greedy-best-hits flag and --cov-mode 1 -c 1, where hits containing targets that fully cover the query are being filtered out. I've outlined the behaviour in the minimal repro below:

Using file seq.fasta with content (the sequence is arbitrary):

>random_seq
NTLGEQNARIWSEREVGICLARKHPFLDFSVVKDMIATGLGTDYKALSPQ

and running easy-search with the above fasta as both the query and target:

mmseqs easy-search seq.fasta seq.fasta result.tsv tmp --cov-mode 1 -c 1 --greedy-best-hits produces an empty result.tsv

Running the above command without the --greedy-best-hits flag results in the expected result.tsv:

random_seq	random_seq	1.000	50	0	0	1	50	1	50	1.586E-38	108

My guess is that there is a bug in the summarizeresult command being run when --greedy-best-hits is passed in:
summarizeresult tmp/15248047069833069253/result tmp/15248047069833069253/result_best -a 1 --overlap 0 -c 1 --threads 8 --compressed 0 -v 3

But it's quite possible I'm missing something... thanks for your time!

Expected Behavior

A result.tsv with content

random_seq	random_seq	1.000	50	0	0	1	50	1	50	1.586E-38	108

Current Behavior

An empty result.tsv

Steps to Reproduce (for bugs)

echo ">random_seq\nNTLGEQNARIWSEREVGICLARKHPFLDFSVVKDMIATGLGTDYKALSPQ" > seq.fasta
mmseqs easy-search seq.fasta seq.fasta result.tsv tmp --cov-mode 1 -c 1 --greedy-best-hits

MMseqs Output (for bugs)

Create directory tmp
easy-search seq.fasta seq.fasta result.tsv tmp --cov-mode 1 -c 1 --greedy-best-hits 

MMseqs Version:                        	13.45111
Substitution matrix                    	nucl:nucleotide.out,aa:blosum62.out
Add backtrace                          	false
Alignment mode                         	3
Alignment mode                         	0
Allow wrapped scoring                  	false
E-value threshold                      	0.001
Seq. id. threshold                     	0
Min alignment length                   	0
Seq. id. mode                          	0
Alternative alignments                 	0
Coverage threshold                     	1
Coverage mode                          	1
Max sequence length                    	65535
Compositional bias                     	1
Max reject                             	2147483647
Max accept                             	2147483647
Include identical seq. id.             	false
Preload mode                           	0
Pseudo count a                         	1
Pseudo count b                         	1.5
Score bias                             	0
Realign hits                           	false
Realign score bias                     	-0.2
Realign max seqs                       	2147483647
Gap open cost                          	nucl:5,aa:11
Gap extension cost                     	nucl:2,aa:1
Zdrop                                  	40
Threads                                	8
Compressed                             	0
Verbosity                              	3
Seed substitution matrix               	nucl:nucleotide.out,aa:VTML80.out
Sensitivity                            	5.7
k-mer length                           	0
k-score                                	2147483647
Alphabet size                          	nucl:5,aa:21
Max results per query                  	300
Split database                         	0
Split mode                             	2
Split memory limit                     	0
Diagonal scoring                       	true
Exact k-mer matching                   	0
Mask residues                          	1
Mask lower case residues               	0
Minimum diagonal score                 	15
Spaced k-mers                          	1
Spaced k-mer pattern                   	
Local temporary path                   	
Rescore mode                           	0
Remove hits by seq. id. and coverage   	false
Sort results                           	0
Mask profile                           	1
Profile E-value threshold              	0.001
Global sequence weighting              	false
Allow deletions                        	false
Filter MSA                             	1
Maximum seq. id. threshold             	0.9
Minimum seq. id.                       	0
Minimum score per column               	-20
Minimum coverage                       	0
Select N most diverse seqs             	1000
Min codons in orf                      	30
Max codons in length                   	32734
Max orf gaps                           	2147483647
Contig start mode                      	2
Contig end mode                        	2
Orf start mode                         	1
Forward frames                         	1,2,3
Reverse frames                         	1,2,3
Translation table                      	1
Translate orf                          	0
Use all table starts                   	false
Offset of numeric ids                  	0
Create lookup                          	0
Add orf stop                           	false
Overlap between sequences              	0
Sequence split mode                    	1
Header split mode                      	0
Chain overlapping alignments           	0
Merge query                            	1
Search type                            	0
Search iterations                      	1
Start sensitivity                      	4
Search steps                           	1
Exhaustive search mode                 	false
Filter results during exhaustive search	0
Strand selection                       	1
LCA search mode                        	false
Disk space limit                       	0
MPI runner                             	
Force restart with latest tmp          	false
Remove temporary files                 	true
Alignment format                       	0
Format alignment output                	query,target,fident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,evalue,bits
Database output                        	false
Overlap threshold                      	0
Database type                          	0
Shuffle input database                 	true
Createdb mode                          	0
Write lookup file                      	0
Greedy best hits                       	true

Alignment backtraces will be computed, since they were requested by output format.
createdb seq.fasta tmp/15248047069833069253/query --dbtype 0 --shuffle 1 --createdb-mode 0 --write-lookup 0 --id-offset 0 --compressed 0 -v 3 

Converting sequences

Time for merging to query_h: 0h 0m 0s 9ms
Time for merging to query: 0h 0m 0s 8ms
Database type: Aminoacid
Time for processing: 0h 0m 0s 38ms
createdb seq.fasta tmp/15248047069833069253/target --dbtype 0 --shuffle 1 --createdb-mode 0 --write-lookup 0 --id-offset 0 --compressed 0 -v 3 

Converting sequences

Time for merging to target_h: 0h 0m 0s 9ms
Time for merging to target: 0h 0m 0s 8ms
Database type: Aminoacid
Time for processing: 0h 0m 0s 40ms
Create directory tmp/15248047069833069253/search_tmp
search tmp/15248047069833069253/query tmp/15248047069833069253/target tmp/15248047069833069253/result tmp/15248047069833069253/search_tmp -a 1 --alignment-mode 3 -c 1 --cov-mode 1 -s 5.7 --remove-tmp-files 1 

prefilter tmp/15248047069833069253/query tmp/15248047069833069253/target tmp/15248047069833069253/search_tmp/6923777973734096903/pref_0 --sub-mat nucl:nucleotide.out,aa:blosum62.out --seed-sub-mat nucl:nucleotide.out,aa:VTML80.out -k 0 --k-score 2147483647 --alph-size nucl:5,aa:21 --max-seq-len 65535 --max-seqs 300 --split 0 --split-mode 2 --split-memory-limit 0 -c 1 --cov-mode 1 --comp-bias-corr 1 --diag-score 1 --exact-kmer-matching 0 --mask 1 --mask-lower-case 0 --min-ungapped-score 15 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca 1 --pcb 1.5 --threads 8 --compressed 0 -v 3 -s 5.7 

Query database size: 1 type: Aminoacid
Estimated memory consumption: 977M
Target database size: 1 type: Aminoacid
Index table k-mer threshold: 112 at k-mer size 6 
Index table: counting k-mers
[=================================================================] 100.00% 1 eta -
Index table: Masked residues: 0
Index table: fill
[=================================================================] 100.00% 1 eta -
Index statistics
Entries:          41
DB size:          488 MB
Avg k-mer size:   0.000001
Top 10 k-mers
    DFVKIA	1
    ATLTKA	1
    SEEGLA	1
    IWEEIC	1
    PFDSKD	1
    CLRHLD	1
    DMAGTD	1
    QNRWRE	1
    GENRSE	1
    LAKPDF	1
Time for index table init: 0h 0m 0s 778ms
Process prefiltering step 1 of 1

k-mer similarity threshold: 112
Starting prefiltering scores calculation (step 1 of 1)
Query db start 1 to 1
Target db start 1 to 1
[=================================================================] 100.00% 1 eta -

602.620000 k-mers per position
41 DB matches per sequence
0 overflows
0 queries produce too many hits (truncated result)
1 sequences passed prefiltering per query sequence
1 median result list length
0 sequences with 0 size result lists
Time for merging to pref_0: 0h 0m 0s 0ms
Time for processing: 0h 0m 1s 740ms
align tmp/15248047069833069253/query tmp/15248047069833069253/target tmp/15248047069833069253/search_tmp/6923777973734096903/pref_0 tmp/15248047069833069253/result --sub-mat nucl:nucleotide.out,aa:blosum62.out -a 1 --alignment-mode 3 --alignment-output-mode 0 --wrapped-scoring 0 -e 0.001 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 1 --cov-mode 1 --max-seq-len 65535 --comp-bias-corr 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca 1 --pcb 1.5 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --gap-open nucl:5,aa:11 --gap-extend nucl:2,aa:1 --zdrop 40 --threads 8 --compressed 0 -v 3 

Compute score, coverage and sequence identity
Query database size: 1 type: Aminoacid
Target database size: 1 type: Aminoacid
Calculation of alignments
[=================================================================] 100.00% 1 eta -
Time for merging to result: 0h 0m 0s 0ms
1 alignments calculated
1 sequence pairs passed the thresholds (1.000000 of overall calculated)
1.000000 hits per query sequence
Time for processing: 0h 0m 0s 28ms
rmdb tmp/15248047069833069253/search_tmp/6923777973734096903/pref_0 -v 3 

Time for processing: 0h 0m 0s 1ms
rmdb tmp/15248047069833069253/search_tmp/6923777973734096903/aln_0 -v 3 

Time for processing: 0h 0m 0s 1ms
rmdb tmp/15248047069833069253/search_tmp/6923777973734096903/input_0 -v 3 

Time for processing: 0h 0m 0s 1ms
rmdb tmp/15248047069833069253/search_tmp/6923777973734096903/aln_merge -v 3 

Time for processing: 0h 0m 0s 1ms
summarizeresult tmp/15248047069833069253/result tmp/15248047069833069253/result_best -a 1 --overlap 0 -c 1 --threads 8 --compressed 0 -v 3 

[=================================================================] 100.00% 1 eta -
Time for merging to result_best: 0h 0m 0s 1ms
Time for processing: 0h 0m 0s 17ms
convertalis tmp/15248047069833069253/query tmp/15248047069833069253/target tmp/15248047069833069253/result_best result.tsv --sub-mat nucl:nucleotide.out,aa:blosum62.out --format-mode 0 --format-output query,target,fident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,evalue,bits --translation-table 1 --gap-open nucl:5,aa:11 --gap-extend nucl:2,aa:1 --db-output 0 --db-load-mode 0 --search-type 0 --threads 8 --compressed 0 -v 3 

[=================================================================] 100.00% 1 eta -
Time for merging to result.tsv: 0h 0m 0s 1ms
Time for processing: 0h 0m 0s 11ms
rmdb tmp/15248047069833069253/result_best -v 3 

Time for processing: 0h 0m 0s 2ms
rmdb tmp/15248047069833069253/result -v 3 

Time for processing: 0h 0m 0s 1ms
rmdb tmp/15248047069833069253/target -v 3 

Time for processing: 0h 0m 0s 1ms
rmdb tmp/15248047069833069253/target_h -v 3 

Time for processing: 0h 0m 0s 1ms
rmdb tmp/15248047069833069253/query -v 3 

Time for processing: 0h 0m 0s 2ms
rmdb tmp/15248047069833069253/query_h -v 3

Context

See above.

Your Environment

Include as many relevant details about the environment you experienced the bug in.

Git commit used (The string after "MMseqs Version:" when you execute MMseqs without any parameters):
Which MMseqs version was used (Statically-compiled, self-compiled, Homebrew, etc.): 13.45111
For self-compiled and Homebrew: Compiler and Cmake versions used and their invocation:
Server specifications (especially CPU support for AVX2/SSE and amount of system memory):
Operating system and version:

The text was updated successfully, but these errors were encountered:

milot-mirdita · 2022-08-09T06:20:40Z

Thanks a lot! Makes sense, hits that match the coverage threshold should not be rejected.

milot-mirdita · 2022-08-09T06:23:43Z

You can download the precompiled static binary that contains the fix (and latest git changes) at https://mmseqs.com/latest in about an hour when all tests pass through successfully.

milenkodespotovic · 2022-08-09T15:53:58Z

Thank you so much for the incredibly fast fix!

milenkodespotovic · 2022-08-09T17:37:18Z

@milot-mirdita out of curiosity, any plans for a release in the next while?

milot-mirdita closed this as completed in 67949d7 Aug 9, 2022

milot-mirdita reopened this Aug 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`easy-search` with `--cov-mode 1 -c 1 --greedy-best-hits` drops exact matches #586

`easy-search` with `--cov-mode 1 -c 1 --greedy-best-hits` drops exact matches #586

milenkodespotovic commented Aug 9, 2022

milot-mirdita commented Aug 9, 2022

milot-mirdita commented Aug 9, 2022

milenkodespotovic commented Aug 9, 2022

milenkodespotovic commented Aug 9, 2022

easy-search with --cov-mode 1 -c 1 --greedy-best-hits drops exact matches #586

easy-search with --cov-mode 1 -c 1 --greedy-best-hits drops exact matches #586

Comments

milenkodespotovic commented Aug 9, 2022

Expected Behavior

Current Behavior

Steps to Reproduce (for bugs)

MMseqs Output (for bugs)

Context

Your Environment

milot-mirdita commented Aug 9, 2022

milot-mirdita commented Aug 9, 2022

milenkodespotovic commented Aug 9, 2022

milenkodespotovic commented Aug 9, 2022

`easy-search` with `--cov-mode 1 -c 1 --greedy-best-hits` drops exact matches #586

`easy-search` with `--cov-mode 1 -c 1 --greedy-best-hits` drops exact matches #586