[translate] Rewrite GFF locus_tag test

Switches the `translate-with-gff-and-locus-tag.t` test to using the same data as the corresponding `translate-with-gff-and-gene.t`, thus testing _just_ the change in GFF syntax. The replaced test used TB data which was problematic for a few reasons: - The VCF file wasn't correctly formatted, with a mixture of haploid and diploid genotypes. TreeTime's `read_vcf` will error on this after <neherlab/treetime#263> is merged. - The VCF encoded genotypes of '.' which were read as allele="N", however these were supposed to be reference bases (encoded as genotype="0"). If we update the VCF then the aa_muts.json are very different. This speaks to a bigger problem with test data such as this - there is no assurance that the output data is correct. For this reason I prefer the "simple-genome" tests for which we can validate every mutation.
nextstrain · Dec 21, 2023 · 2c9db82 · 2c9db82
1 parent 7e1eb1d
commit 2c9db82
Show file tree

Hide file tree

Showing 8 changed files with 18 additions and 84,917 deletions.
diff --git a/tests/functional/translate/cram/translate-with-gff-and-locus-tag.t b/tests/functional/translate/cram/translate-with-gff-and-locus-tag.t
@@ -4,23 +4,26 @@ Setup
   $ export DATA="$TESTDIR/../data"
   $ export SCRIPTS="$TESTDIR/../../../../scripts"
 
-Translate amino acids for genes using a GFF3 file where the gene names are stored in a qualifier named "locus_tag".
+This is an identical test setup as `translate-with-gff-and-gene.t` but using locus_tag instead of gene in the GFF
+
+  $ cat >genemap.gff <<~~
+  > ##gff-version 3
+  > ##sequence-region PF13/251013_18 1 10769
+  > PF13/251013_18	GenBank	gene	91	456	.	+	.	locus_tag="CA"
+  > PF13/251013_18	GenBank	gene	457	735	.	+	.	locus_tag="PRO"
+  > ~~
 
   $ ${AUGUR} translate \
-  >   --tree "${DATA}/tb/tree.nwk" \
-  >   --genes "${DATA}/tb/genes.txt" \
-  >   --vcf-reference "${DATA}/tb/ref.fasta" \
-  >   --ancestral-sequences "${DATA}/tb/nt_muts.vcf" \
-  >   --reference-sequence "${DATA}/tb/Mtb_H37Rv_NCBI_Annot.gff" \
-  >   --output-node-data aa_muts.json \
-  >   --alignment-output translations.vcf \
-  >   --vcf-reference-output translations_reference.fasta
-  Gene length of 'rrs' is not a multiple of 3. will pad with N
-  Read in 187 specified genes to translate.
-  Read in 188 features from reference sequence file
-  162 genes had no mutations and so have been be excluded.
+  >   --tree "${DATA}/zika/tree.nwk" \
+  >   --ancestral-sequences "${DATA}/zika/nt_muts.json" \
+  >   --reference-sequence genemap.gff \
+  >   --output-node-data aa_muts.json
+  Read in 3 features from reference sequence file
+  Validating schema of '.+/nt_muts.json'... (re)
   amino acid mutations written to .* (re)
 
-  $ python3 "${SCRIPTS}/diff_jsons.py" "${DATA}/tb/aa_muts.json" aa_muts.json \
-  >  --exclude-regex-paths "root\['annotations'\]\['.+'\]\['seqid'\]"
+  $ python3 "${SCRIPTS}/diff_jsons.py" \
+  >  --exclude-regex-paths "['seqid']" -- \
+  >  "${DATA}/zika/aa_muts_gff.json" \
+  >  aa_muts.json
   {}
diff --git a/tests/functional/translate/data/tb/Mtb_H37Rv_NCBI_Annot.gff b/tests/functional/translate/data/tb/Mtb_H37Rv_NCBI_Annot.gff