Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Formatting of tRNA FASTA reference file #70

Open
Phontia opened this issue Aug 9, 2024 · 9 comments
Open

Formatting of tRNA FASTA reference file #70

Phontia opened this issue Aug 9, 2024 · 9 comments

Comments

@Phontia
Copy link

Phontia commented Aug 9, 2024

Dear all,

As I saw in previous discussions here (and also as stated in the STAR protocol), when using self-generated tRNA fasta references files created with tRNAScan as input for mimseq, this often results in errors when running, such as the one I'm getting here:

Command line:
mimseq -t ACV_Nuclear_tRNAs.fa -o ACV_Nuclear_tRNAs.out -p ACV_Plastid_tRNAs.fa -m PR_Mito_tRNAs.fa --cluster-id 0.90 --threads 15 --min-cov 0.0005 --max-mismatches 0.075 --control-condition Total -n Acv_1st_Mimseqrun --out-dir mimseq_Results_Acv --max-multi 4 --remap --remap-mismatches 0.05 ACVSampleData.txt

Output:
File "/cluster/home/andreaf/miniconda3/envs/mimseq/lib/python3.7/site-packages/mimseq/tRNAtools.py", line 346, in intronRemover ID = re.search("tRNAscan-SE ID: (.*?)\).|\((chr.*?)-",seqIO_dict[seqIO_record].description).groups() AttributeError: 'NoneType' object has no attribute 'groups'

The headers and naming convention of tRNA sequences in tRNAscan otput files differ considerably from the ones you see downloaded from GtRNAdb. For example in the headers of my tRNAscan output file there is no chromosome information, etc.

Unfortunately, some plant species I'm working with have no entry on online tRNA databases (neither GtRNAdb or PlantRNA), and thus I don't have much choice (I'm trying to use closely related species in some cases, but for some of my plants none are available).

How exactly should I format the outputs from tRNAscan so that they can be read by mimseq?

Thank you for your time.

Best,
Andrea

@nedialkova-lab
Copy link
Owner

Hi Andrea,

As an example, see the S. cerevisiae sacCer3 pre-built reference here:
https://github.com/nedialkova-lab/mim-tRNAseq/tree/master/mimseq/data/sacCer3-eColitK.

Pay close attention to the order of information, the number of fields separated by spaces, the naming convention for tRNA genes, and the tRNAScan-SE ID given in parentheses that matches the corresponding entry in the out file (specified with -o).

If mitochondrial and/or plastid sequences are specified with -m, these also require specific formatting that is distinct from the nuclear genomic file as they match the format provided by the mitotRNAdb. If formatting is incorrect, mimseq will produce an error similar to the error you get. See an example here for correct formatting: https://github.com/nedialkova-lab/mim-tRNAseq/blob/master/mimseq/data/sacCer3-eColitK/sacCer3-mitotRNAs.fa; again, pay close attention to the number of fields per sequence header (i.e., 5) and the use of “|” as a field separator. In this case, the first field specifying the ID can be any user-chosen value. The third field is a unique species code, which is unused by mimseq but must be present.

Note that the use of custom input files for species not currently included in mimseq has not been extensively tested.

@Phontia
Copy link
Author

Phontia commented Aug 9, 2024

Thank you for the answer!

I'll try to make a script to rename all of the headers! If it works I will update you and give the script here, so it may be useful also for others :)

Have a nice weekend!

Best,
Andrea

@Phontia
Copy link
Author

Phontia commented Aug 15, 2024

Dear all,

I made some progress in regards to formatting correctly the input files. The software performs the first allignment, but when it comes to the misincorporations and RT stops analysis I encounter the following error message:

Command line:
mimseq -t xReRenamed_ACV_Plastid_NuclearFormat_tRNA.fa -o ACV_Plastid_tRNAs_NuclearFormat.out --threads 15 --min-cov 0.0005 --max-mismatches 0.1 --control-condition Total -n ACV_1st_Mimseqrun_Plastid --out-dir ACV_1st_mimseqRun_Plastid --max-multi 4 --remap --remap-mismatches 0.075 ACVSampleData.txt

Output:
.
.
.
+--------------------------------------------------------------------+
| Analysing misincorporations and stops to RT, and analysing 3' ends |
+--------------------------------------------------------------------+
2024-08-15 18:11:08,636 [INFO ] ** Discovering unannotated modifications for realignment **
2024-08-15 18:11:09,195 [INFO ] Analysing ACV_1st_mimseqRun_Plastid/357391_1-Library_A_S4_R1_001_ACV_ChlA.trimFinal.unpaired_uniq.bam...
2024-08-15 18:11:09,197 [INFO ] Analysing ACV_1st_mimseqRun_Plastid/357391_2-Library_B_S3_R1_001_ACV_ChlB.trimFinal.unpaired_uniq.bam...
2024-08-15 18:11:09,200 [INFO ] Analysing ACV_1st_mimseqRun_Plastid/357391_4-Library_D_S1_R1_001_ACV_ChlD.trimFinal.unpaired_uniq.bam...
2024-08-15 18:11:09,206 [INFO ] Analysing ACV_1st_mimseqRun_Plastid/357391_3-Library_C_S2_R1_001_ACV_TotC.trimFinal.unpaired_uniq.bam...
2024-08-15 18:11:09,209 [INFO ] Analysing ACV_1st_mimseqRun_Plastid/357391_2-Library_B_S3_R1_001_Acv_TotB.trimFinal.unpaired_uniq.bam...
2024-08-15 18:11:09,300 [INFO ] Analysing ACV_1st_mimseqRun_Plastid/357391_3-Library_C_S2_R1_001_Acv_ChlC.trimFinal.unpaired_uniq.bam...
2024-08-15 18:11:09,394 [INFO ] Analysing ACV_1st_mimseqRun_Plastid/357391_4-Library_D_S1_R1_001_Acv_TotD.trimFinal.unpaired_uniq.bam...
2024-08-15 18:11:09,403 [INFO ] Analysing ACV_1st_mimseqRun_Plastid/357391_1-Library_A_S4_R1_001_ACV_TotA.trimFinal.unpaired_uniq.bam...
2024-08-15 18:13:44,202 [INFO ] Finding potential unannotated mods for ACV_1st_mimseqRun_Plastid/357391_2-Library_B_S3_R1_001_ACV_ChlB.trimFinal.unpaired_uniq.bam
2024-08-15 18:17:56,094 [INFO ] Finding potential unannotated mods for ACV_1st_mimseqRun_Plastid/357391_2-Library_B_S3_R1_001_Acv_TotB.trimFinal.unpaired_uniq.bam
2024-08-15 18:22:18,996 [INFO ] Finding potential unannotated mods for ACV_1st_mimseqRun_Plastid/357391_1-Library_A_S4_R1_001_ACV_ChlA.trimFinal.unpaired_uniq.bam
2024-08-15 18:24:45,001 [INFO ] Finding potential unannotated mods for ACV_1st_mimseqRun_Plastid/357391_3-Library_C_S2_R1_001_ACV_TotC.trimFinal.unpaired_uniq.bam
2024-08-15 18:26:56,899 [INFO ] Finding potential unannotated mods for ACV_1st_mimseqRun_Plastid/357391_3-Library_C_S2_R1_001_Acv_ChlC.trimFinal.unpaired_uniq.bam
2024-08-15 18:27:34,302 [INFO ] Finding potential unannotated mods for ACV_1st_mimseqRun_Plastid/357391_4-Library_D_S1_R1_001_Acv_TotD.trimFinal.unpaired_uniq.bam
2024-08-15 18:29:00,018 [INFO ] Finding potential unannotated mods for ACV_1st_mimseqRun_Plastid/357391_4-Library_D_S1_R1_001_ACV_ChlD.trimFinal.unpaired_uniq.bam
2024-08-15 18:29:13,515 [INFO ] Finding potential unannotated mods for ACV_1st_mimseqRun_Plastid/357391_1-Library_A_S4_R1_001_ACV_TotA.trimFinal.unpaired_uniq.bam
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/cluster/home/andreaf/miniconda3/envs/mimseq.2/lib/python3.7/multiprocessing/pool.py", line 121, in worker
result = (True, func(*args, **kwds))
File "/cluster/home/andreaf/miniconda3/envs/mimseq.2/lib/python3.7/multiprocessing/pool.py", line 44, in mapstar
return list(map(*args))
File "/cluster/home/andreaf/miniconda3/envs/mimseq.2/lib/python3.7/site-packages/mimseq/mmQuant.py", line 349, in bamMods_mp
new_mods, new_Inosines = unknownMods(inputs, knownTable, cluster_dict, modTable_prop, misinc_thresh, cov_table_newMods, min_cov, tRNA_dict, remap)
File "/cluster/home/andreaf/miniconda3/envs/mimseq.2/lib/python3.7/site-packages/mimseq/mmQuant.py", line 90, in unknownMods
if (tRNA_dict[isodecoder]['sequence'][pos-1] == 'A' and list(modTable[isodecoder][pos].keys())[list(modTable[isodecoder][pos].values()).index(max(modTable[isodecoder][pos].values()))] == 'G' and pos-1 == min(anticodon)):
ValueError: min() arg is an empty sequence
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/cluster/home/andreaf/miniconda3/envs/mimseq.2/bin/mimseq", line 10, in
sys.exit(main())
File "/cluster/home/andreaf/miniconda3/envs/mimseq.2/lib/python3.7/site-packages/mimseq/mimseq.py", line 421, in main
args.misinc_thresh, args.mito, args.plastid, args.pretrnas, args.local_mod, args.p_adj, args.crosstalks, args.sampledata)
File "/cluster/home/andreaf/miniconda3/envs/mimseq.2/lib/python3.7/site-packages/mimseq/mimseq.py", line 130, in mimseq
new_mods, new_Inosines, filtered_cov, filter_warning, unsplitCluster_lookup,readRef_unsplit_newNames = generateModsTable(coverageData, out, name, threads, min_cov, mismatch_dict, insert_dict, del_dict, cluster_dict, cca, remap, misinc_thresh, mod_lists, Inosine_lists, tRNA_dict, Inosine_clusters, unique_isodecoderMMs_new, splitBool_new, isodecoder_sizes, unsplitCluster_lookup, cluster, crosstalks)
File "/cluster/home/andreaf/miniconda3/envs/mimseq.2/lib/python3.7/site-packages/mimseq/mmQuant.py", line 695, in generateModsTable
new_mods, new_Inosines, readRef_match_dict = zip(*pool.map(func, bamlist))
File "/cluster/home/andreaf/miniconda3/envs/mimseq.2/lib/python3.7/multiprocessing/pool.py", line 268, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/cluster/home/andreaf/miniconda3/envs/mimseq.2/lib/python3.7/multiprocessing/pool.py", line 657, in get
raise self._value
ValueError: min() arg is an empty sequence
`

If you were able to indicate a possible issue it would be extremely appreciated. I'm a bit stuck right now...

Best,
Andrea Fontana

@nedialkova-lab
Copy link
Owner

Hi Andrea,
There still seem to be some issues with the reference file. Would you mind sharing it (by email) so I can take a look? If I could have a subset of one of the fastq files for running a few tests it would be ideal. Thanks!

Danny

@drewjbeh
Copy link
Collaborator

Hi Andrea,

The error is coming from an error in determining the anticodon position in a one or more tRNAs/clusters. mimseq should generate a file whose name ends in *isodecoderTranscripts_align.stk (where * is replaced by the name you give your run). If the run doesn't complete it should just be somewhere in the output folder. Could you please send that? This should also give us some more info about how your custom reference is functioning. Thanks!

@Phontia
Copy link
Author

Phontia commented Aug 26, 2024

Dear all,

Sorry for not updating you but last week I managed to solve this issue :)

It had to do with the reference fasta file and the first name of the header. To put it simply I was writing the header like this:

Arabidopsis_thaliana_tRNA-Arg-CCT-1-4 (tRNAscan-SE ID: chr5.trna94) best isotype model: Arg (Sc: 65.1) Arg (CCT) 73 bp Sc: 55.8 chr5:13805982-13806054 (-)

I then exchanged the numbers in the first name of the header so it became like this:

Arabidopsis_thaliana_tRNA-Arg-CCT-4-1 (tRNAscan-SE ID: chr5.trna94) best isotype model: Arg (Sc: 65.1) Arg (CCT) 73 bp Sc: 55.8 chr5:13805982-13806054 (-)

And then I didn't get an error anymore! So it seems to be running fine now :)

P.S. Just as another question regarding organellar tRNAs. Are organellar tRNAs that seem to be encoded in the nuclear genome automatically removed by mimseq? (there is a brief mention of this in your original publication). I ask this because I noticed while building my nuclear tRNA reference fasta file with tRNAscan that there are many tRNA genes in the nuclear genome that are very similar (in some cases even identical!) to chloroplast/mitochondrial encoded tRNAs. I wonder if this is a case of not properly "cleaning up" the reference nuclear genome from organellar-derived reads, or if actually there are organellar tRNAs encoded in the nucleus.

In general, do you maybe know whether these tRNAs are transcribed and give rise to functional tRNAs? Sadly, I don't find much about them in the literature...

Thanks again for the help!

Best,
Andrea

@drewjbeh
Copy link
Collaborator

Hi Andrea,

Thaks for the update. Do you know why changing the name worked? What led you to change the transcript and gene number of this first gene? Was the name duplicated somewhere else? This info can help us help others and troubleshoot in the future.

For the filtering step, we simply filter by name. In an older version of gtRNAdb, nuclear-encoded mitochondrial tRNA genes were named with "nmt". This has since changed, but the filtering step in mimseq is still there. We also filter the undefined tRNAs that contain "Und".

@Phontia
Copy link
Author

Phontia commented Aug 26, 2024

Dear all,

I'll take you through how I adapted all of the outputs from tRNAscan such that they can be read by mimseq. I hope it can explain how I solved my issue, and that it can be helpful also to others.

For both the plastid and mitochondrial fasta reference files I renamed all of the sequence headers of the tRNAScan output .fa file manually. I followed very closely the formatting you have in your example input files for mitochondrial tRNAs (5 fields divided by "|"). This is not so heavy, since the number of tRNAs encoded in organelles is relatively little and it doesn't take so much time to rename all of the headers manually.

For the reference .fa and .out files for nuclear tRNAs however it is more complicated, since the number of predicted tRNA genes are many. To start, I first remove all duplicate identical sequences and keep only one.

The headers of tRNA sequences in the .fa file directly outputted from tRNAscan look something like this for me (I show just one sequence as an example):

>GM067323.1.trna2 GM067323.1:2494823-2494894 (+) Asp (GTC) 72 bp Sc: 48.6
GACGTTGTAGTATAGTGGTGAGTATTCCCGCCTGTCACCCGGGTGACCCGGGTTCCATCC
CCGGCAACGGCG

I then use this code, to generate the tRNA scan ID field, which is needed by Mimseq to find the appropriate data in the .out file.

awk '/^>/ {id=substr($1, 2); id_with_underscore=id; sub(/\./, "_", id_with_underscore); desc=$0; sub(id, id_with_underscore, desc); sub(/.*/, ">"id_with_underscore" (tRNAscan-SE ID: "id_with_underscore") "substr(desc,length(id_with_underscore)+3)); print} !/^>/ {print}' Input.fa > Output.fa

And you get sequence headers that look like this:

>GM067323_1.trna2 (tRNAscan-SE ID: GM067323_1.trna2) GM067323.1:2494823-2494894 (+) Asp (GTC) 72 bp Sc: 48.6
GACGTTGTAGTATAGTGGTGAGTATTCCCGCCTGTCACCCGGGTGACCCGGGTTCCATCC
CCGGCAACGGCG

Additionally, the code changes the "." in the sequence name (GM067323.1) to an underscore (GM067323_1), because mimseq will look in the tRNAscan-SE ID field at whatever is after the "." to find the tRNA number in the .out file. If you have a "." in the sequence name this will cause an error.

I also rename all of the sequence names in the .out file (so that they also have the underscore) with the following command

awk 'BEGIN {OFS="\t"} NR==1 {print; next} {sub(/\./, "_", $1); print}' input_file.out > output_file.out

Then, I also remove all tRNAs predicted as pseudogenes (I actually don't know if this is necessary), from both the .fa file and .out file with the following commands

awk '/^>/ {pseudogene = /Possible pseudogene/} !pseudogene' input.fasta > output.fasta

awk '!/pseudo/' input.out > output.out

Finally, I generate the first name of the headers in the .fa file using a Python script (YourSpeciesName should be exchanged with the name of your species). I save this Python script as FormatFastaHeaders.py

import re
import sys
def process_fasta(input_file, output_file):
    occurrences = {}
    with open(input_file, 'r') as infile, open(output_file, 'w') as outfile:
        header_line = None

        for line in infile:
            line = line.strip()
            
            if line.startswith('>'):
                if header_line:
                    outfile.write(header_line + '\n')
                    header_line = None

                # Extract the amino acid and codon
                match = re.search(r'\s(\w+)\s\((\w+)\)', line)
                if match:
                    amino_acid = match.group(1)
                    codon = match.group(2)
                    
                    key = f"{amino_acid}-{codon}"
                    
                    count = occurrences.get(key, 0) + 1
                    occurrences[key] = count
                    
                    # Construct the new header prefix
                    new_prefix = f"YourSpeciesName_tRNA-{amino_acid}-{codon}-1-{count}"
                    
                    # Replace the old prefix with the new one
                    header_line = re.sub(r'^>\S+', f'>{new_prefix}', line, 1)
            else:
                if header_line:
                    outfile.write(header_line + '\n')
                    header_line = None
                outfile.write(line + '\n')

        if header_line:
            outfile.write(header_line + '\n')

if __name__ == "__main__":
    if len(sys.argv) != 3:
        print("Usage: python format_fasta_headers.py input_fasta_file output_fasta_file")
        sys.exit(1)

    input_file = sys.argv[1]
    output_file = sys.argv[2]
    process_fasta(input_file, output_file)

I then run this command

python3 FormatFastaHeaders.py input.fasta output.fasta

And you should get headers like this, where the last number in the first name of the header indicates the "isodecoder number" (i.e. the number depends on how many tRNAs were found previously with the same amino acid/anticodon)

>YourSpeciesName_tRNA-Asp-GTC-1-1 (tRNAscan-SE ID: GM067323_1.trna2) GM067323.1:2494823-2494894 (+) Asp (GTC) 72 bp Sc: 48.6
GACGTTGTAGTATAGTGGTGAGTATTCCCGCCTGTCACCCGGGTGACCCGGGTTCCATCC
CCGGCAACGGCG
>YourSpeciesName_tRNA-Asp-GTC-1-2 (tRNAscan-SE ID: GM067330_1.trna36) GM067330.1:105009878-105009965 (+) Asp (GTC) 88 bp Sc: 34.5
GGAGAGATAGCTGAGTGAACTAAAGCGGCGTATTGTCAATCTGTTGTACGAGTTAATTGT
ACCGAGGGTTCGAATCTCTCTCTTTTCG
>YourSpeciesName_tRNA-Asp-GTC-1-3 (tRNAscan-SE ID: CM067331_1.trna54) CM067331.1:138290376-138290441 (+) Asp (GTC) 66 bp Sc: 24.8
GAAATAGCTCAGTAGGTTAGAGTGCTGGTTTGTCATGCCAGAAGCCGTGGGTTCGAACTC
TATACA

This is because I had the impression from one of your example reference files (i.e. sacCer3-tRNAs.fa) that the last number in the first name of the sequence header (highlighted with **) indicates the number of isodecoders.

As an example from your sacCer3-tRNAs.fa file:

>Saccharomyces_cerevisiae_tRNA-Ala-AGC-1-**1** (tRNAscan-SE ID: chrIV.trna3) chrIV:410379-410451 (+) Ala (AGC) 73 bp Sc: 69.6
GGGCGTGTGGCGTAGTCGGTAGCGCGCTCCCTTAGCATGGGAGAGGtCTCCGGTTCGATT
CCGGACTCGTCCA
>Saccharomyces_cerevisiae_tRNA-Ala-AGC-1-**10** (tRNAscan-SE ID: chrXIII.trna9) chrXIII:768369-768441 (-) Ala (AGC) 73 bp Sc: 69.6
GGGCGTGTGGCGTAGTCGGTAGCGCGCTCCCTTAGCATGGGAGAGGtCTCCGGTTCGATT
CCGGACTCGTCCA
>Saccharomyces_cerevisiae_tRNA-Ala-AGC-1-**11** (tRNAscan-SE ID: chrXVI.trna7) chrXVI:856902-856974 (+) Ala (AGC) 73 bp Sc: 69.6
GGGCGTGTGGCGTAGTCGGTAGCGCGCTCCCTTAGCATGGGAGAGGtCTCCGGTTCGATT
CCGGACTCGTCCA
>Saccharomyces_cerevisiae_tRNA-Ala-AGC-1-**2** (tRNAscan-SE ID: chrVI.trna6) chrVI:204924-204996 (-) Ala (AGC) 73 bp Sc: 69.6
GGGCGTGTGGCGTAGTCGGTAGCGCGCTCCCTTAGCATGGGAGAGGtCTCCGGTTCGATT
CCGGACTCGTCCA
>Saccharomyces_cerevisiae_tRNA-Ala-AGC-1-**3** (tRNAscan-SE ID: chrVII.trna15) chrVII:774349-774421 (+) Ala (AGC) 73 bp Sc: 69.6
GGGCGTGTGGCGTAGTCGGTAGCGCGCTCCCTTAGCATGGGAGAGGtCTCCGGTTCGATT
CCGGACTCGTCCA
>Saccharomyces_cerevisiae_tRNA-Ala-AGC-1-**4** (tRNAscan-SE ID: chrVIII.trna8) chrVIII:146242-146314 (-) Ala (AGC) 73 bp Sc: 69.6
GGGCGTGTGGCGTAGTCGGTAGCGCGCTCCCTTAGCATGGGAGAGGtCTCCGGTTCGATT
CCGGACTCGTCCA
>Saccharomyces_cerevisiae_tRNA-Ala-AGC-1-**5** (tRNAscan-SE ID: chrX.trna22) chrX:197313-197385 (-) Ala (AGC) 73 bp Sc: 69.6
GGGCGTGTGGCGTAGTCGGTAGCGCGCTCCCTTAGCATGGGAGAGGtCTCCGGTTCGATT
CCGGACTCGTCCA
>Saccharomyces_cerevisiae_tRNA-Ala-AGC-1-**6** (tRNAscan-SE ID: chrXI.trna4) chrXI:219895-219967 (+) Ala (AGC) 73 bp Sc: 69.6
GGGCGTGTGGCGTAGTCGGTAGCGCGCTCCCTTAGCATGGGAGAGGtCTCCGGTTCGATT
CCGGACTCGTCCA
>Saccharomyces_cerevisiae_tRNA-Ala-AGC-1-**7** (tRNAscan-SE ID: chrXI.trna9) chrXI:517988-518060 (+) Ala (AGC) 73 bp Sc: 69.6
GGGCGTGTGGCGTAGTCGGTAGCGCGCTCCCTTAGCATGGGAGAGGtCTCCGGTTCGATT
CCGGACTCGTCCA
>Saccharomyces_cerevisiae_tRNA-Ala-AGC-1-**8** (tRNAscan-SE ID: chrXII.trna16) chrXII:656934-657006 (-) Ala (AGC) 73 bp Sc: 69.6
GGGCGTGTGGCGTAGTCGGTAGCGCGCTCCCTTAGCATGGGAGAGGtCTCCGGTTCGATT
CCGGACTCGTCCA
>Saccharomyces_cerevisiae_tRNA-Ala-AGC-1-**9** (tRNAscan-SE ID: chrXIII.trna17) chrXIII:321147-321219 (-) Ala (AGC) 73 bp Sc: 69.6
GGGCGTGTGGCGTAGTCGGTAGCGCGCTCCCTTAGCATGGGAGAGGtCTCCGGTTCGATT
CCGGACTCGTCCA
>Saccharomyces_cerevisiae_tRNA-Ala-TGC-1-**1** (tRNAscan-SE ID: chrI.trna2) chrI:166267-166339 (+) Ala (TGC) 73 bp Sc: 76.0
GGGCACATGGCGCAGTTGGTAGCGCGCTTCCCTTGCAAGGAAGAGGtCATCGGTTCGATT
CCGGTTGCGTCCA
>Saccharomyces_cerevisiae_tRNA-Ala-TGC-1-**2** (tRNAscan-SE ID: chrV.trna16) chrV:312023-312095 (-) Ala (TGC) 73 bp Sc: 76.0
GGGCACATGGCGCAGTTGGTAGCGCGCTTCCCTTGCAAGGAAGAGGtCATCGGTTCGATT
CCGGTTGCGTCCA
>Saccharomyces_cerevisiae_tRNA-Ala-TGC-1-**3** (tRNAscan-SE ID: chrVII.trna17) chrVII:794417-794489 (+) Ala (TGC) 73 bp Sc: 76.0
GGGCACATGGCGCAGTTGGTAGCGCGCTTCCCTTGCAAGGAAGAGGtCATCGGTTCGATT
CCGGTTGCGTCCA
.
.
.

However, when I ran mimseq with my file just like that, I was getting the aforementioned error message.

After testing with different subsets of my fasta reference file, I realized this problem was likely due to mimseq not properly distinguishing between isodecoders, since if I only had tRNAs with different aminoacid/anticodons combinations in my reference no error would appear.

Quite funnily, it was just necessary to exchange these last two numbers in the first name of the header to get mimseq running without problems :)

Thus my headers now look like this, Where the "isodecoder number" now comes first

>YourSpeciesName_tRNA-Asp-GTC-1-1 (tRNAscan-SE ID: GM067323_1.trna2) GM067323.1:2494823-2494894 (+) Asp (GTC) 72 bp Sc: 48.6
GACGTTGTAGTATAGTGGTGAGTATTCCCGCCTGTCACCCGGGTGACCCGGGTTCCATCC
CCGGCAACGGCG
>YourSpeciesName_tRNA-Asp-GTC-2-1 (tRNAscan-SE ID: GM067330_1.trna36) GM067330.1:105009878-105009965 (+) Asp (GTC) 88 bp Sc: 34.5
GGAGAGATAGCTGAGTGAACTAAAGCGGCGTATTGTCAATCTGTTGTACGAGTTAATTGT
ACCGAGGGTTCGAATCTCTCTCTTTTCG
>YourSpeciesName_tRNA-Asp-GTC-3-1 (tRNAscan-SE ID: CM067331_1.trna54) CM067331.1:138290376-138290441 (+) Asp (GTC) 66 bp Sc: 24.8
GAAATAGCTCAGTAGGTTAGAGTGCTGGTTTGTCATGCCAGAAGCCGTGGGTTCGAACTC
TATACA

I'm not sure why before it was giving an error, but I'm happy now it seems to be working :)

Please feel free to ask any more questions if anything is not clear.

Best,
Andrea

@drewjbeh
Copy link
Collaborator

Hi Andrea,

Wow, thank you so much for the detailed description. This will certainly be useful for many other users trying to build their own references.

One note about the naming. The first number in the name should indicate isodecoder, while the second number should indicate the gene copy number (I guess arbitrarily assigned, as far as I can tell) within that isodecoder. This means that all Ala-TGC-1 sequences from the example above are identical. This is important for clustering and for nomenclature in general, so be sure to adhere to these rules. Simply renaming the sequence to have a different isodecoder number will effect many parts of alignment, clustering, modification analysis, and generally any downstream analysis as you might be referring to two identical sequences as separate isodecoders.

I can offer some help though. I am familiar with errors in mimseq where, for example, the naming is not done correctly. That is, if a tRNA name indicated an isodecoder but its sequence is not identical to other genes in that isodecoder family, or vice versa where a gene has a unique isodecoder number but actually shares a mature sequence with another set of genes in a differently numbered isodecoder. In both of these cases I have manually curated and changed tRNA names to be correct - see this as an example for human hg38. Especially under the #### NB #### section.

If you could also script a way to check these naming restrictions and conventions and name your tRNAs accordingly, this would be a great pipeline for new input species!

Thanks,
Drew

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants