Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Joint Genotype Calling Recalibrated VCFs Duplicate Entries; VEP is (Intentionally) Eating Variants #966

Closed
evanwu1119 opened this issue Mar 16, 2023 · 3 comments
Labels
bug Something isn't working

Comments

@evanwu1119
Copy link

evanwu1119 commented Mar 16, 2023

Hi all, I encountered some unexpected results when testing the GATK joint genotype calling pipeline on my dataset.

  1. The number of variants in the final recalibrated VCF (joint_germline_recalibrated.vcf.gz) is exactly double that in the raw VCF (joint_germline.vcf.gz) outputted from GenotypeGVCFs. This is because all variants are completely duplicated due to merging separately recalibrated SNP and indel VCFs, the entries are also not identical as shown below. The fix is to sequentially recalibrate for SNPs then indels, or vice versa, on the same VCF.

Raw VCF counts and example entries:

$ bcftools +counts joint_germline.vcf.gz
Number of samples: 15
Number of SNPs:    13717272
Number of INDELs:  3503929
Number of MNPs:    0
Number of others:  0
Number of sites:   17113477

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT
1       10146   rs375931351     AC      A       102.67  .       AC=3;AF=0.1;AN=30;BaseQRankSum=0.967;DB;DP=698;ExcessHet=0;FS=0;InbreedingCoeff=0.5022;MLEAC=2;MLEAF=0.067;MQ=40.97;MQRankSum=0.967;QD=14.67;ReadPosRankSum=0.967;SOR=0.223     GT:AD:DP:GQ:PL
1       10327   rs112750067     T       C       106.77  .       AC=2;AF=0.067;AN=30;DB;DP=280;ExcessHet=0;FS=0;InbreedingCoeff=0.214;MLEAC=4;MLEAF=0.133;MQ=30.71;QD=25.36;SOR=2.833    GT:AD:DP:GQ:PGT:PID:PL:PS

Recalibrated VCF counts and example entries:

$ bcftools +counts joint_germline_recalibrated.vcf.gz
Number of samples: 15
Number of SNPs:    27434544
Number of INDELs:  7007858
Number of MNPs:    0
Number of others:  0
Number of sites:   34226954

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT
1       10146   rs375931351     AC      A       102.67  PASS    AC=3;AF=0.1;AN=30;BaseQRankSum=0.967;DB;DP=698;ExcessHet=0;FS=0;InbreedingCoeff=0.5022;MLEAC=2;MLEAF=0.067;MQ=40.97;MQRankSum=0.967;NEGATIVE_TRAIN_SITE;POSITIVE_TRAIN_SITE;QD=14.67;ReadPosRankSum=0.967;SOR=0.223;VQSLOD=-9.728e-01;culprit=DP        GT:AD:DP:GQ:PL
1       10146   rs375931351     AC      A       102.67  .       AC=3;AF=0.1;AN=30;BaseQRankSum=0.967;DB;DP=698;ExcessHet=0;FS=0;InbreedingCoeff=0.5022;MLEAC=2;MLEAF=0.067;MQ=40.97;MQRankSum=0.967;QD=14.67;ReadPosRankSum=0.967;SOR=0.223     GT:AD:DP:GQ:PL
1       10327   rs112750067     T       C       106.77  .       AC=2;AF=0.067;AN=30;DB;DP=280;ExcessHet=0;FS=0;InbreedingCoeff=0.214;MLEAC=4;MLEAF=0.133;MQ=30.71;QD=25.36;SOR=2.833    GT:AD:DP:GQ:PGT:PID:PL:PS
1       10327   rs112750067     T       C       106.77  VQSRTrancheSNP99.90to100.00     AC=2;AF=0.067;AN=30;DB;DP=280;ExcessHet=0;FS=0;InbreedingCoeff=0.214;MLEAC=4;MLEAF=0.133;MQ=30.71;NEGATIVE_TRAIN_SITE;POSITIVE_TRAIN_SITE;QD=25.36;SOR=2.833;VQSLOD=-1.242e+00;culprit=SOR      GT:AD:DP:GQ:PGT:PID:PL:PS
  1. I used the --merge annotation method and the final annotated VCF (joint_germline_recalibrated_snpEff_VEP.ann.vcf.gz) is missing almost half of the variants compared to the un-annotated VCFs, which was pretty alarming. After tracing the changes to the VCF it appears that VEP is intentionally filtering out common variants via the --filter_common external argument, which removes alleles with global AF > 0.01. While this is not a bug per-se, it would be helpful to make this an optional argument or at least document it somewhere to prevent scares and confusion.

Variant counts following through the annotation pipeline:

$ bcftools +counts joint_germline_snpEff.ann.vcf.gz
Number of samples: 15
Number of SNPs:    13717272
Number of INDELs:  3503929
Number of MNPs:    0
Number of others:  0
Number of sites:   17113477

$ bcftools +counts joint_germline_snpEff_VEP.ann.vcf.gz
Number of samples: 15
Number of SNPs:    5954775
Number of INDELs:  3370807
Number of MNPs:    0
Number of others:  0
Number of sites:   9223960
@FriederikeHanssen FriederikeHanssen added bug Something isn't working docs enhancement New feature or request labels Mar 21, 2023
@FriederikeHanssen
Copy link
Contributor

Hey @evanwu1119 ! Thanks for reporting and investigating the issues. We can for sure add a parameter to prevent this. However in the meantime, you can also add a custom config for VEP that removes the filter_common parameter

@FriederikeHanssen
Copy link
Contributor

@amizeranschi @asp8200 this sounds relevant for your case.

@asp8200
Copy link
Contributor

asp8200 commented Jun 14, 2023

This issue should be solved by #1102

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Archived in project
Development

No branches or pull requests

3 participants