Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve VCF parsing #263

Merged
merged 15 commits into from
Jan 22, 2024
Merged

Improve VCF parsing #263

merged 15 commits into from
Jan 22, 2024

Conversation

jameshadfield
Copy link
Contributor

This isn't complete yet, but if people wanted to take a look and see if they are happy with the direction this is going in that'd be great!

I still want to take a close look at how insertions are being encoded in the "sequences" output, and is treetime/augur is ok with that (i.e. data['sequences'][<sample>][<pos>] is more than one character). And ideally I'd sort out the writing of VCFs as well to both preserve the ploidy representation of alleles and also use the correct chromosome name!

The VCF 4.2/4.3 spec is precise about the format of meta-information /
header-line / data-line syntax. We assert that the file contents matches
what we expect and raise errors if not. This allows code simplifications.

I've reordered the if/else block to match how the file appears, not the
previous approach of placing the data lines first as they're the "most
common so first in 'if-list'". This is much more readable to me!
The test failure is due to the failure to parse multiple ALT alleles
and will be fixed in a subsequent commit.
Only addresses the case of a single homozygous nuc ALT allele.
Specifically, the previous code would ignore all ALT alleles beyond a GT
call of 1 (see previously failing test
TestSimpleVcf::test_mutation_parsing)

Comments added regarding other potential bugs, but I'll add tests /
fixes in subsequent commits.
I was hoping the examples would be a little more exhaustive, however
they do cover the basics. The examples in v4.3 are almost identical.

The encoding of insertions, which may or may not be a bug, is detailed
both in the code and in the tests.
Detailed explanation added in test `TestNoCallsOrMissing`

Sections of code were removed which incorrectly considered the "."
character could be part of a multi-character alternate allele, however
the spec is clear that the "." allele can only exist as a
single-character allele. We enforce that alleles conform to this
upstream of the code, so it wasn't possible that `alt[pos+i]=='.'`

Closes #261
The VCF spec uses a genotype of '0' to mean that the reference base(s)
is unchanged for this sample. The previous '.' means an uncalled allele,
in v4.2 "a call cannot be made for a sample at a given locus" and in
v4.3 "a call cannot be made for a sample at a given locus".
This is currently unused but a subsequent commit will allow `write_vcf`
to use this information.

As part of reading this information we now enforce:
- only one chromosome can be defined in the vcf. This will provide
a helpful error message in situations such as <#247>
- genotype calls much consistently use the same ploidy

There is a lot more checking we could do if we wish to parse the
meta-lines
@jameshadfield
Copy link
Contributor Author

@rneher could you suggest someone to review it (or look at it yourself?) 🙏

There are a number of bugfixes related to parsing haploid (or polyploid homozygous) VCFs and I'm now reasonably confident the parsing is correct. The encoding of insertions has the potential to double count the base before the insertion (see comments added in code here) but I don't know what treetime does with these; for the purposes of augur ancestral, which is what motivated this work, insertions are not considered.

I haven't touched the parsing of heterozygous genotypes.

I've fixed a number of bugs in the writing of VCFs, but I haven't thoroughly reviewed that code and the test VCF I've added here is pretty simple.

@jameshadfield jameshadfield marked this pull request as ready for review December 12, 2023 04:10
@rneher
Copy link
Member

rneher commented Dec 12, 2023

thanks, @jameshadfield! I'll have a look. This code hasn't really been looked at in a long time and was written for rather specific needs in our early MTb work, so really appreciate some attention!

@rneher
Copy link
Member

rneher commented Dec 14, 2023

this all looks good to me. happy to merge it in and make a release.

@jameshadfield
Copy link
Contributor Author

this all looks good to me. happy to merge it in and make a release.

Thanks! I'm just doing some final checks comparing augur ancestral & augur translate using FASTA vs VCF inputs on the Cholera dataset so let's hold off on this for a few days just in case there are any more little fixes I need to add.

jameshadfield added a commit to nextstrain/augur that referenced this pull request Dec 21, 2023
The added test here fails under TreeTime 0.11.1 but has been fixed as
part of <neherlab/treetime#263>.

Description of the bug (as it relates to the test added in `vcf.t`):

The SNPs at nt 33 are encoded in the VCF as:
1	    33	.   A	C,G	.	    .	    .	    GT	    1	        2	        0
where ALT 1 ("C") is on Sample_A and ALT 2 ("G") is on Sample_B.
The ALT 2 is not being parsed by `read_vcf`, which results in
a changed mutation profile at pos 33:
.       **FASTA input**               **VCF input**
.         |---G33C-- sample_A           |---A33C-- sample_A
.  --A33G-|                      -------|
.         |--------- sample_B           |--------- sample_B
.
Because of this bug, the following test fails.

The `read_vcf` function is used in augur commands ancestral, refine,
sequence-traits, translate and tree.
jameshadfield added a commit to nextstrain/augur that referenced this pull request Dec 21, 2023
Switches the `translate-with-gff-and-locus-tag.t` test to using the
same data as the corresponding `translate-with-gff-and-gene.t`, thus
testing _just_ the change in GFF syntax.

The replaced test used TB data which was problematic for a few reasons:
- The VCF file wasn't correctly formatted, with a mixture of haploid and
  diploid genotypes. TreeTime's `read_vcf` will error on this after
  <neherlab/treetime#263> is merged.
- The VCF encoded genotypes of '.' which were read as allele="N", however
  these were supposed to be reference bases (encoded as genotype="0").
  If we update the VCF then the aa_muts.json are very different. This
  speaks to a bigger problem with test data such as this - there is no
  assurance that the output data is correct. For this reason I prefer
  the "simple-genome" tests for which we can validate every mutation.
jameshadfield added a commit to nextstrain/augur that referenced this pull request Dec 21, 2023
This requires a TreeTime with
<neherlab/treetime#263>. The benefit is that the
produced VCF file will encode genotypes with the same ploidy as the
input, as well as using the same chromosome name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that PyVCF (or the more current PyVCF3) is not used to replace/supplement read_vcf for parsing efficiency reasons, but could it be useful for write_vcf?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks partly to your suggestion to use cProfile, the VCF writing is now down to a handful of seconds 58ab78d 🎉

jameshadfield added a commit to nextstrain/augur that referenced this pull request Dec 29, 2023
NOTE: This requires a corresponding change in TreeTime, included in PR
<neherlab/treetime#263>

Previously we were not calculating a mask for VCF files. This adds it
and applies it the VCF output.

See <#1382> and the parent
commit for more context.

Closes #1382 (together with the parent commit)
This is only used if provided, otherwise we fall back to the default
ploidy of diploid (unchanged) and chromosome name of "1" (changed from
the previously hardcoded "MTB_anc"). Note that the spec states that
no-call genotypes should be encoded to reflect the ploidy (e.g. "./."
for diploid and "." only for haploid) which is a change from previous
behaviour.

Closes #262
This is (partly) behind the bug described in
<nextstrain/augur#1380>
This (optional) ability allows the node-data JSON and VCF output from
`augur ancestral` to match.
This allows the VCF file to be parsed by bcftools. See
<#263 (comment)>
These were previously printed with ALT="" (and all genotype calls
indicating to use the reference allele).
The 2 main changes here are changing the const sites list to a set to
improve lookup time and pre-computing the positions at each variable
site in a more efficient manner.

For my test VCF (100k rows, 100 samples) `write_vcf` went from 95
seconds to 0.7 seconds. For the full 800k rows VCF, `write_vcf` went
from ~8 hours to 5.6 seconds (5000x speedup).

Times & profiling using cProfile

CLoses nextstrain/augur#1378
jameshadfield added a commit to nextstrain/augur that referenced this pull request Dec 30, 2023
NOTE: This requires a corresponding change in TreeTime, included in PR
<neherlab/treetime#263>

Previously we were not calculating a mask for VCF files. This adds it
and applies it the VCF output.

See <#1382> and the parent
commit for more context.

Closes #1382 (together with the parent commit)
@jameshadfield
Copy link
Contributor Author

@rneher I'm happy with the state of this now and don't plan to do any more work here. There were a few more commits added since you last saw it. Once this is merged and released I'll update nextstrain/augur#1355 to use the newest version.

@rneher rneher merged commit 5dbb1cb into master Jan 22, 2024
5 checks passed
@rneher rneher deleted the james/vcf branch January 22, 2024 08:50
jameshadfield added a commit to nextstrain/augur that referenced this pull request Jan 22, 2024
The added test here fails under TreeTime 0.11.1 but has been fixed as
part of <neherlab/treetime#263>.

Description of the bug (as it relates to the test added in `vcf.t`):

The SNPs at nt 33 are encoded in the VCF as:
1	    33	.   A	C,G	.	    .	    .	    GT	    1	        2	        0
where ALT 1 ("C") is on Sample_A and ALT 2 ("G") is on Sample_B.
The ALT 2 is not being parsed by `read_vcf`, which results in
a changed mutation profile at pos 33:
.       **FASTA input**               **VCF input**
.         |---G33C-- sample_A           |---A33C-- sample_A
.  --A33G-|                      -------|
.         |--------- sample_B           |--------- sample_B
.
Because of this bug, the following test fails.

The `read_vcf` function is used in augur commands ancestral, refine,
sequence-traits, translate and tree.
jameshadfield added a commit to nextstrain/augur that referenced this pull request Jan 22, 2024
Switches the `translate-with-gff-and-locus-tag.t` test to using the
same data as the corresponding `translate-with-gff-and-gene.t`, thus
testing _just_ the change in GFF syntax.

The replaced test used TB data which was problematic for a few reasons:
- The VCF file wasn't correctly formatted, with a mixture of haploid and
  diploid genotypes. TreeTime's `read_vcf` will error on this after
  <neherlab/treetime#263> is merged.
- The VCF encoded genotypes of '.' which were read as allele="N", however
  these were supposed to be reference bases (encoded as genotype="0").
  If we update the VCF then the aa_muts.json are very different. This
  speaks to a bigger problem with test data such as this - there is no
  assurance that the output data is correct. For this reason I prefer
  the "simple-genome" tests for which we can validate every mutation.
jameshadfield added a commit to nextstrain/augur that referenced this pull request Jan 22, 2024
This requires a TreeTime with
<neherlab/treetime#263>. The benefit is that the
produced VCF file will encode genotypes with the same ploidy as the
input, as well as using the same chromosome name
jameshadfield added a commit to nextstrain/augur that referenced this pull request Jan 22, 2024
NOTE: This requires a corresponding change in TreeTime, included in PR
<neherlab/treetime#263>

Previously we were not calculating a mask for VCF files. This adds it
and applies it the VCF output.

See <#1382> and the parent
commit for more context.

Closes #1382 (together with the parent commit)
jameshadfield added a commit to nextstrain/augur that referenced this pull request Jan 22, 2024
NOTE: This requires a corresponding change in TreeTime, included in PR
<neherlab/treetime#263>

Previously we were not calculating a mask for VCF files. This adds it
and applies it the VCF output.

See <#1382> and the parent
commit for more context.

Closes #1382 (together with the parent commit)
jameshadfield added a commit to nextstrain/augur that referenced this pull request Jan 22, 2024
NOTE: This requires a corresponding change in TreeTime, included in PR
<neherlab/treetime#263>

Previously we were not calculating a mask for VCF files. This adds it
and applies it the VCF output.

See <#1382> and the parent
commit for more context.

Closes #1382 (together with the parent commit)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants