Error occuring when ncfp encounters sequences which have been removed from NCBI database #34

ArtingBioinfo · 2022-07-13T16:02:46Z

Summary: NCBI sequence identifiers which have been removed from the NCBI data base cause a "no link/record returned for: xyz".

Description:

When attempting to use the ncfp command to create a file containing the nucleotide sequences from amino acid sequences using .fasta files which contain both the amino acid sequence and its associated accession number, a set of errors are listed when the command attempts to obtain the nucleotide sequences. The first being an index error: list index out of range and the second being: NCFPEFetchException: no link / record returned for: xyz

Reproducible Steps:

The command written which achieved these errors is as follows:
ncfp SigR_500_aa_seqs.fasta \ ncfp_nucleo_seqs \ leealexanderkeenan@gmail.com

Current Output:

Process input sequences: 100%|██████████████████████████████████████████████████████████████████████| 500/500 [04:14<00:00, 1.96it/s]
Search NT IDs: 1%|▍ | 3/500 [00:05<16:09, 1.95s/it]
Traceback (most recent call last):
File "/home/lee/miniconda3/lib/python3.9/site-packages/ncbi_cds_from_protein/entrez.py", line 216, in search_nt_ids
idlist = [lid["Id"] for lid in result[0]["LinkSetDb"][0]["Link"]]
IndexError: list index out of range

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/lee/miniconda3/bin/ncfp", line 10, in
sys.exit(run_main())
File "/home/lee/miniconda3/lib/python3.9/site-packages/ncbi_cds_from_protein/scripts/ncfp.py", line 246, in run_main
addedrows, countfail = search_nt_ids(qrecords, cachepath, args.retries, disabletqdm=args.disabletqdm)
File "/home/lee/miniconda3/lib/python3.9/site-packages/ncbi_cds_from_protein/entrez.py", line 218, in search_nt_ids
raise NCFPEFetchException("No link/record returned for %s" % record.id)
ncbi_cds_from_protein.entrez.NCFPEFetchException: No link/record returned for WP_078606386.1

Expected Output:

Fasta file containing the nucleotide sequences returned from the amino acid sequences given.

Operating System: Linux (Linux for Windows, flavor Xubuntu)

widdowquinn · 2022-07-20T15:02:16Z

Thanks @ArtingBioinfo - could you please provide an input sequence file that could help me reproduce this error, so I can try to fix it?

ArtingBioinfo · 2022-07-20T15:16:37Z

Hello! I've attached the input file which was used to obtain the aforementioned error, I hope it helps! Thank you for your time and all the best, Lee Keenan

…

On Wed, 20 Jul 2022 at 16:02, Leighton Pritchard ***@***.***> wrote: Thanks @ArtingBioinfo <https://github.com/ArtingBioinfo> - could you please provide an input sequence file that could help me reproduce this error, so I can try to fix it? — Reply to this email directly, view it on GitHub <#34 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/A2BLZJMEDAIY3AUUSUZMPZDVVAIIJANCNFSM53PIADRQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

widdowquinn · 2022-07-20T15:19:43Z

Hi @ArtingBioinfo - I think you'll have to attach the file through the web interface at #34 - the email attachment doesn't seem to come through.

L.

widdowquinn · 2022-07-20T15:28:27Z

Please don't provide sequences as a pdf. Please upload (drag/drop into the box) the FASTA file itself.

NOTE as GitHub doesn't recognise FASTA as a format, you can either compress the file as a .zip file, or change the file extension to .txt.

ArtingBioinfo · 2022-07-20T15:56:58Z

Galaxy26-[SigR_500_blastdbcmd] fasta.txt

That makes far more sense as to why I couldn't originally upload the file. That should be an acceptable txt file now.

Apologies for the inconvenience caused.

widdowquinn · 2022-07-20T18:12:33Z

The issue seems to arise from input sequences that have been suppressed or removed in NCBI, such as

For sequences such as these, which are not annotated on any genome, we will not be able to recover a coding sequence.

In addition, these WP_ records are identical protein groups (IPGs). This is the way that RefSeq keeps database size small - it bundles all identical/non-redundant proteins together in a single record. Although the protein sequences may be identical, the underlying coding sequences for each individual protein might not be, and ncfp does not currently do the job of tracking all of those underlying sequences and assigning new IDs so that they can be backtranslated. It is better to remove all the MULTISPECIES/RefSeq IPGs from your input.

widdowquinn added the bug A problem or other undesirable behaviour in the code label Jul 20, 2022

widdowquinn self-assigned this Jul 20, 2022

widdowquinn mentioned this issue Jul 21, 2022

Issue 34: add option to allow alternative start sites #37

Merged

21 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error occuring when ncfp encounters sequences which have been removed from NCBI database #34

Error occuring when ncfp encounters sequences which have been removed from NCBI database #34

ArtingBioinfo commented Jul 13, 2022

widdowquinn commented Jul 20, 2022

ArtingBioinfo commented Jul 20, 2022 via email

widdowquinn commented Jul 20, 2022

widdowquinn commented Jul 20, 2022 •

edited

Loading

ArtingBioinfo commented Jul 20, 2022

widdowquinn commented Jul 20, 2022

Error occuring when ncfp encounters sequences which have been removed from NCBI database #34

Error occuring when ncfp encounters sequences which have been removed from NCBI database #34

Comments

ArtingBioinfo commented Jul 13, 2022

Summary: NCBI sequence identifiers which have been removed from the NCBI data base cause a "no link/record returned for: xyz".

Description:

Reproducible Steps:

Current Output:

Expected Output:

Operating System: Linux (Linux for Windows, flavor Xubuntu)

widdowquinn commented Jul 20, 2022

ArtingBioinfo commented Jul 20, 2022 via email

widdowquinn commented Jul 20, 2022

widdowquinn commented Jul 20, 2022 • edited Loading

ArtingBioinfo commented Jul 20, 2022

widdowquinn commented Jul 20, 2022

widdowquinn commented Jul 20, 2022 •

edited

Loading