Skip to content

Commit

Permalink
[translate] Rewrite GFF locus_tag test
Browse files Browse the repository at this point in the history
Switches the `translate-with-gff-and-locus-tag.t` test to using the
same data as the corresponding `translate-with-gff-and-gene.t`, thus
testing _just_ the change in GFF syntax.

The replaced test used TB data which was problematic for a few reasons:
- The VCF file wasn't correctly formatted, with a mixture of haploid and
  diploid genotypes. TreeTime's `read_vcf` will error on this after
  <neherlab/treetime#263> is merged.
- The VCF encoded genotypes of '.' which were read as allele="N", however
  these were supposed to be reference bases (encoded as genotype="0").
  If we update the VCF then the aa_muts.json are very different. This
  speaks to a bigger problem with test data such as this - there is no
  assurance that the output data is correct. For this reason I prefer
  the "simple-genome" tests for which we can validate every mutation.
  • Loading branch information
jameshadfield committed Dec 21, 2023
1 parent 7e1eb1d commit 2c9db82
Show file tree
Hide file tree
Showing 8 changed files with 18 additions and 84,917 deletions.
33 changes: 18 additions & 15 deletions tests/functional/translate/cram/translate-with-gff-and-locus-tag.t
Original file line number Diff line number Diff line change
Expand Up @@ -4,23 +4,26 @@ Setup
$ export DATA="$TESTDIR/../data"
$ export SCRIPTS="$TESTDIR/../../../../scripts"

Translate amino acids for genes using a GFF3 file where the gene names are stored in a qualifier named "locus_tag".
This is an identical test setup as `translate-with-gff-and-gene.t` but using locus_tag instead of gene in the GFF

$ cat >genemap.gff <<~~
> ##gff-version 3
> ##sequence-region PF13/251013_18 1 10769
> PF13/251013_18 GenBank gene 91 456 . + . locus_tag="CA"
> PF13/251013_18 GenBank gene 457 735 . + . locus_tag="PRO"
> ~~

$ ${AUGUR} translate \
> --tree "${DATA}/tb/tree.nwk" \
> --genes "${DATA}/tb/genes.txt" \
> --vcf-reference "${DATA}/tb/ref.fasta" \
> --ancestral-sequences "${DATA}/tb/nt_muts.vcf" \
> --reference-sequence "${DATA}/tb/Mtb_H37Rv_NCBI_Annot.gff" \
> --output-node-data aa_muts.json \
> --alignment-output translations.vcf \
> --vcf-reference-output translations_reference.fasta
Gene length of 'rrs' is not a multiple of 3. will pad with N
Read in 187 specified genes to translate.
Read in 188 features from reference sequence file
162 genes had no mutations and so have been be excluded.
> --tree "${DATA}/zika/tree.nwk" \
> --ancestral-sequences "${DATA}/zika/nt_muts.json" \
> --reference-sequence genemap.gff \
> --output-node-data aa_muts.json
Read in 3 features from reference sequence file
Validating schema of '.+/nt_muts.json'... (re)
amino acid mutations written to .* (re)

$ python3 "${SCRIPTS}/diff_jsons.py" "${DATA}/tb/aa_muts.json" aa_muts.json \
> --exclude-regex-paths "root\['annotations'\]\['.+'\]\['seqid'\]"
$ python3 "${SCRIPTS}/diff_jsons.py" \
> --exclude-regex-paths "['seqid']" -- \
> "${DATA}/zika/aa_muts_gff.json" \
> aa_muts.json
{}
8,415 changes: 0 additions & 8,415 deletions tests/functional/translate/data/tb/Mtb_H37Rv_NCBI_Annot.gff

This file was deleted.

Loading

0 comments on commit 2c9db82

Please sign in to comment.