Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix VCF parsing bugs #1355

Merged
merged 17 commits into from
Jan 22, 2024
Merged

Fix VCF parsing bugs #1355

merged 17 commits into from
Jan 22, 2024

Conversation

jameshadfield
Copy link
Member

The test added here fails on the version of TreeTime augur uses. The tests pass under TreeTime neherlab/treetime#263.

@jameshadfield jameshadfield changed the base branch from james/ancestral-translate-fixes to james/translate/nuc-annotation December 13, 2023 03:41
@jameshadfield jameshadfield force-pushed the james/translate/nuc-annotation branch 3 times, most recently from d1f3247 to c91ca33 Compare December 20, 2023 01:03
Base automatically changed from james/translate/nuc-annotation to master December 20, 2023 01:26
Copy link
Member

@victorlin victorlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having next to zero experience with VCF, I can't speak to the reasoning behind changes. Aside from that, overall code changes look good, especially the revamped testing! Noted a few small things.

tests/functional/ancestral/cram/general.t Outdated Show resolved Hide resolved
augur/translate.py Outdated Show resolved Hide resolved
augur/translate.py Outdated Show resolved Hide resolved
augur/translate.py Outdated Show resolved Hide resolved
jameshadfield added a commit that referenced this pull request Dec 30, 2023
Exclude the Augur version from the diff, otherwise it'll fail on the
next release. Some of these tests were added via this PR but many
weren't, so it's easier to do it all in a single commit.

<#1355 (comment)>
@jameshadfield
Copy link
Member Author

jameshadfield commented Dec 30, 2023

Thanks for the review @victorlin. When paired with neherlab/treetime#263 all tests are passing and running a full cholera analysis using VCFs or using FASTA now produces identical mutations across the tree, so I'm going to draw the line here.

Things to do before merge:

  • Release a new TreeTime version & update here (not needed - we don't pin tightly)
  • Changelog
  • Rebase to fix conflicts

The added test here fails under TreeTime 0.11.1 but has been fixed as
part of <neherlab/treetime#263>.

Description of the bug (as it relates to the test added in `vcf.t`):

The SNPs at nt 33 are encoded in the VCF as:
1	    33	.   A	C,G	.	    .	    .	    GT	    1	        2	        0
where ALT 1 ("C") is on Sample_A and ALT 2 ("G") is on Sample_B.
The ALT 2 is not being parsed by `read_vcf`, which results in
a changed mutation profile at pos 33:
.       **FASTA input**               **VCF input**
.         |---G33C-- sample_A           |---A33C-- sample_A
.  --A33G-|                      -------|
.         |--------- sample_B           |--------- sample_B
.
Because of this bug, the following test fails.

The `read_vcf` function is used in augur commands ancestral, refine,
sequence-traits, translate and tree.
Switches the `translate-with-gff-and-locus-tag.t` test to using the
same data as the corresponding `translate-with-gff-and-gene.t`, thus
testing _just_ the change in GFF syntax.

The replaced test used TB data which was problematic for a few reasons:
- The VCF file wasn't correctly formatted, with a mixture of haploid and
  diploid genotypes. TreeTime's `read_vcf` will error on this after
  <neherlab/treetime#263> is merged.
- The VCF encoded genotypes of '.' which were read as allele="N", however
  these were supposed to be reference bases (encoded as genotype="0").
  If we update the VCF then the aa_muts.json are very different. This
  speaks to a bigger problem with test data such as this - there is no
  assurance that the output data is correct. For this reason I prefer
  the "simple-genome" tests for which we can validate every mutation.
This requires a TreeTime with
<neherlab/treetime#263>. The benefit is that the
produced VCF file will encode genotypes with the same ploidy as the
input, as well as using the same chromosome name
The functionality is unchanged but this centralises the logic. A
subsequent commit will re-use this "is a file a VCF" logic within `augur
translate` so rather than pulling it into it's own function it's better
to use the widely-used .io function.
If VCF input with requested alignment output then we require
--vcf-reference-output. This was not the case in augur 23.1.1 and
earlier, where we would automatically create a filename. This is in
line with a general philosophy of "files only created when requested"
This test was modified to confirm my own understanding of the code which
allowed the next set of commits (modifying `augur translate`). Rather
than drop it, it's helpful to include it in the repo for future
reference. This commit is related to
<#1362>.
jameshadfield added a commit that referenced this pull request Jan 22, 2024
Exclude the Augur version from the diff, otherwise it'll fail on the
next release. Some of these tests were added via this PR but many
weren't, so it's easier to do it all in a single commit.

<#1355 (comment)>
For JSON inputs we were previously incorrectly exporting the
root-sequence translations as the "reference". Instead, we now translate
the provided reference (nuc) sequence. (There is some subtlety here
because the provided nuc reference sequence may in fact be the
root-sequence rather than an actual reference, but this is a problem
with `augur ancestral`. See
<#1362> for more details.)

This allows us to compare the reference translation to the root-sequence
translation and thus detail any AA mutations at the root node. A
side-effect of this is that we now always export an array of mutations
for each gene/CDS at the root node, although this may often be empty.

This brings the behaviour of JSON inputs in-line with that of VCF
inputs.
These currently fail because for VCF inputs we do not export the
translated sequences at the root-node. (We do for JSON inputs.) The
subsequent commit will update translate to do this.
This brings the behaviour in-line with that of JSON inputs.
The failing tests of the previous commit now pass.
The (somewhat confusingly named, and perhaps little used?)
`--root-sequence <FASTA>` is used to report mutations between the
inferred tree root node and the supplied FASTA. The inferred sequence
would always be upper-case, but the supplied reference wasn't converted
so if it were lowercase we would get erroneous mutations such as 'a1A'.

Closes #1376
See <#1380> for a full
description of the bug this fixes (or see the added tests here). The
changes here need to be paired with a corresponding commit in TreeTime
(commit message '[get_tree_dict] restore argument behaviour').
jameshadfield added a commit that referenced this pull request Jan 22, 2024
Exclude the Augur version from the diff, otherwise it'll fail on the
next release. Some of these tests were added via this PR but many
weren't, so it's easier to do it all in a single commit.

<#1355 (comment)>
Copy link

codecov bot commented Jan 22, 2024

Codecov Report

Attention: 6 lines in your changes are missing coverage. Please review.

Comparison is base (9cd915a) 66.29% compared to head (937ab0b) 66.69%.

❗ Current head 937ab0b differs from pull request most recent head aa0b7d8. Consider uploading reports for the commit aa0b7d8 to get more accurate results

Files Patch % Lines
augur/ancestral.py 95.83% 2 Missing and 1 partial ⚠️
augur/translate.py 93.33% 1 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1355      +/-   ##
==========================================
+ Coverage   66.29%   66.69%   +0.40%     
==========================================
  Files          69       69              
  Lines        7260     7321      +61     
  Branches     1780     1797      +17     
==========================================
+ Hits         4813     4883      +70     
+ Misses       2179     2170       -9     
  Partials      268      268              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

There are no functional changes. This is in preparation for subsequent
commits which will fix (separate) bugs associated with mask creation,
mutation calling on the root node, and proper masking of sequences.
Fixes a bug where positions which were all ambiguous (i.e. masked)
positions were being reported as mutations in the JSON output.

Furthermore, while the mask was being applied to sequence export to make
all masked positions "N", if we are inferring ambiguous sites then we
should use the reference base instead. This is now fixed.

Closes #1382 for the FASTA-input case only.

Unfortunately the added test doesn't actually recreate the first bug -
for this simple example TreeTime is not inferring a ATGC allele at the
all-Ns position. I can recreate this (and discovered it) using Cholera
data. I presume this is related to numerical consistency across the
(flat) likelihood values in this small test case.

Variable names are changed: `root_sequence` to `reference_sequence`.
This is much clearer (to my eyes at least).

NOTE: A further bug was discovered while working on this, which involved
behaviour when we are not inferring ambiguous bases. See the added test
here, which is currently disabled as otherwise CI fails. This bug is not
newly introduced by this work.
NOTE: This requires a corresponding change in TreeTime, included in PR
<neherlab/treetime#263>

Previously we were not calculating a mask for VCF files. This adds it
and applies it the VCF output.

See <#1382> and the parent
commit for more context.

Closes #1382 (together with the parent commit)
Exclude the Augur version from the diff, otherwise it'll fail on the
next release. Some of these tests were added via this PR but many
weren't, so it's easier to do it all in a single commit.

<#1355 (comment)>
This is required for VCF related tests added in <#1355>

Note that different parts of our CI can use different TreeTime versions
according to their installation method. For instance, "pytest-cram"
installs Augur & TreeTime via pip, thus getting 0.11.2 at the time of
writing. "pathogen-repo-ci" first installs augur via conda, and thus
gets 0.11.1.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants