Duphold with Nanopore data? #45

AnneBoshove · 2022-01-14T11:26:53Z

Hi,

I used duphold on some VCFs containing SVs called with Sniffles where the alignment was done using minimap2. The data used is all nanopore longread data. I was wondering if duphold is supposed to work properly on long read data? As quite a bit of the results seemed off/incorrect with very high numbers. Here's a sample of some of the results (parsed to obtain the info relevant for me):

DEL | Chr3 | 21648500 | 21946887 | -298387 | IMPRECISE | 10 | 01/Jan | coding_sequence_variant&5_prime_UTR_variant&intron_variant&feature_truncation | ARHGAP17 | ENSSSCG00000031488 | 181.481 | 101.031 | 178.182
DEL | Chr5 | 90902052 | 90907562 | -5510 | PRECISE | 39 | 0/1 | coding_sequence_variant&intron_variant&feature_truncation | CDK17 | ENSSSCG00000028182 | 0.963636 | 0.386861 | 0.946429
DEL | Chr2 | 122165466 | 122173123 | -7657 | PRECISE | 31 | 0/1 | coding_sequence_variant&intron_variant&feature_truncation | COMMD10 | ENSSSCG00000014223 | 161.818 | 143.548 | 161.818
INS | Chr6 | 13229037 | 13229037 | 46 | PRECISE | 21 | 0/1 | coding_sequence_variant&feature_elongation | GLG1 | ENSSSCG00000030420 | 0.814815 | 1 | 0.814815
DEL | Chr10 | 66054310 | 66056761 | -2451 | PRECISE | 62 | 0/1 | coding_sequence_variant&intron_variant&feature_truncation | ITIH5 | ENSSSCG00000011129 | 0.909091 | 0.438596 | 0.925926
DEL | Chr2 | 28721000 | 28772243 | -51243 | PRECISE | 69 | 0/1 | coding_sequence_variant&intron_variant&feature_truncation | KIAA1549L | ENSSSCG00000013311 | 143.636 | 0.153696 | 143.636
INV | Chr2 | 28705133 | 28707047 | 1914 | PRECISE | 130 | 0/1 | coding_sequence_variant&intron_variant | KIAA1549L | ENSSSCG00000013311 | 112.727 | 0.136564 | 114.815
INV | Chr2 | 28705133 | 28707047 | 1914 | PRECISE | 143 | 0/1 | coding_sequence_variant&intron_variant | KIAA1549L | ENSSSCG00000013311 | 112.727 | 0.136564 | 114.815
DEL | ChrX | 90375498 | 90376075 | -577 | PRECISE | 232 | 01/Jan | coding_sequence_variant&intron_variant&feature_truncation | MID2 | ENSSSCG00000012564 | 117.857 | 0.103774 | 117.857
INS | Chr13 | 135062890 | 135063265 | 1039 | IMPRECISE | 20 | 0/1 | coding_sequence_variant&intron_variant&feature_elongation | MUC13 | ENSSSCG00000011862 | 0.75 | 0.976744 | 0.792453

Thanks in advance,

Kind regards,
Anne

brentp · 2022-01-14T12:06:23Z

Hi, yes, duphold should work very well on long read data.
a lot of those do seem extreme. How deep is the coverage for this sample?
I would look at a few of them in IGV.
If this is the worst of thousands of variants, it doesn't seem too surprising and is probably because of weird regions (I see MUC13 in there, for example). If this is a random subset of variants or representative of the results, then something else must be going on.

AnneBoshove · 2022-01-14T12:15:23Z

Thanks for your reply @brentp

The coverage should be around 120X as 3 flowcells were used in the sequencing.
Unfortunately this is just a random sample so not the worst, i've seen several values around 300/400, the highest i've seen was around 900. Values around 100 are quite common in my output. Looking at the results in a genome browser is a bit difficult for me as my device isn't the greatest and struggles with loading/displaying such large files, but I will try.

Could it be that something is off about the format of my VCF that I used as input for duphold? They were ran through sniffles and then VEP, i'll give a few sample lines:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  LWlongreadsMD.sorted.bam
Chr1    267634  0       N       AGTGTCTCATGTCCGGGTCCGTATCCGTGTCCCATGTCCATGTCCCGTGTCCGTGTCCACGTTGTCTCCATGTCCCGTGTCCGTGTCTCATGTCGGCGTCCGTGTCTCATGTCCTGCTCGTGTCCGTGTCTCAGGTCTGGGTCCTGTGTCCGGGTCCGTGTCCCGTGTCCCGTGTCCGTGTCCTGTCCACAGTGTCCGTGTCTCATGTCCGGGTCCCGTGTCCGTGTCCCATGTCCATGTCCAG    .       PASS    IMPRECISE;SVMETHOD=Snifflesv1.0.12;CHR2=Chr1;END=267781;STD_quant_start=12.555591;STD_quant_stop=74.402381;Kurtosis_quant_start=-1.235287;Kurtosis_quant_stop=-1.844707;SVTYPE=INS;SUPTYPE=AL;SVLEN=246;STRANDS=+-;STRANDS2=19,10,19,10;RE=29;REF_strand=36,21;Strandbias_pval=1;AF=0.337209;CSQ=insertion|downstream_gene_variant|MODIFIER||ENSSSCG00000041881|Transcript|ENSSSCT00000070783|lncRNA|||||||||||1725|1||||LW.gtf.gz|,insertion|intron_variant&non_coding_transcript_variant&feature_elongation|MODIFIER||ENSSSCG00000041881|Transcript|ENSSSCT00000072648|lncRNA||1/1||||||||||1||||LW.gtf.gz|,insertion|intron_variant&non_coding_transcript_variant&feature_elongation|MODIFIER||ENSSSCG00000041881|Transcript|ENSSSCT00000091536|lncRNA||1/1||||||||||1||||LW.gtf.gz| GT:DR:DV        0/1:57:29
Chr1    267767  1       TGGGTCCTGTGTCCGTGTCCCATGTCCATGTCCCGTGTCCGTGTCTCAGGT     N       .       PASS    PRECISE;SVMETHOD=Snifflesv1.0.12;CHR2=Chr1;END=267818;STD_quant_start=3.683942;STD_quant_stop=3.693624;Kurtosis_quant_start=0.301027;Kurtosis_quant_stop=0.380196;SVTYPE=DEL;SUPTYPE=AL;SVLEN=-51;STRANDS=+-;STRANDS2=17,11,17,11;RE=28;REF_strand=36,21;Strandbias_pval=1;AF=0.329412;CSQ=deletion|downstream_gene_variant|MODIFIER||ENSSSCG00000041881|Transcript|ENSSSCT00000070783|lncRNA|||||||||||1858|1||||LW.gtf.gz|,deletion|intron_variant&non_coding_transcript_variant&feature_truncation|MODIFIER||ENSSSCG00000041881|Transcript|ENSSSCT00000072648|lncRNA||1/1||||||||||1||||LW.gtf.gz|,deletion|intron_variant&non_coding_transcript_variant&feature_truncation|MODIFIER||ENSSSCG00000041881|Transcript|ENSSSCT00000091536|lncRNA||1/1||||||||||1||||LW.gtf.gz|  GT:DR:DV        0/1:57:28
Chr1    294446  2       GTGTGCGTGTGTTGTGTGTGGGGTGTGTGTGTGTGTGGGGTGTGTGTGTGGTGTGTGTGTTGTGTGTTTGTGGTGTGTGTGGTGTGTATGTG    N       .       PASS    IMPRECISE;SVMETHOD=Snifflesv1.0.12;CHR2=Chr1;END=294538;STD_quant_start=17.472056;STD_quant_stop=21.153121;Kurtosis_quant_start=-0.231708;Kurtosis_quant_stop=0.206432;SVTYPE=DEL;SUPTYPE=AL,NR;SVLEN=-92;STRANDS=+-;STRANDS2=17,5,17,5;RE=22;REF_strand=33,19;Strandbias_pval=0.288655;AF=0.297297;CSQ=deletion|intergenic_variant|MODIFIER||||||||||||||||||||||      GT:DR:DV        0/0:52:22

brentp · 2022-01-14T12:22:07Z

Hi, perhaps you can plot a few of these with samplot?
If you show the result here, I might be able to tell more what's going on.

AnneBoshove · 2022-01-14T13:38:44Z

I managed to look at the data in JBrowse and looked at a few of the SVs with high numbers;

This is the following DUP:

DUP | Chr6 | 165030128 | 165048325 | 18197 | IMPRECISE | 149 | 0/1 | transcript_amplification | ENSSSCG00000003889
DHFC: 962.963 | DHFFC: 597.701 | DHBFC: 981.132

And here's some deletions

DEL Chr13 73763246 73770714 -7468 PRECISE 34 0/1 transcript_ablation ENSSSCG00000045067
DHFC: 310.714 DHFFC: 263.636 DHBFC: 316.364

DEL Chr2 122165466 122173123 -7657 PRECISE 31 0/1 coding_sequence_variant&intron_variant&feature_truncation COMMD10 ENSSSCG00000014223 161.818 143.548 161.818

Only thing that seems a bit strange to me it that those deletions have a higher coverage, I don't see what it could be about the duplication that gives it such high values though.

brentp · 2022-01-14T16:13:15Z

Yes, something is wrong. Would be nice to see these in IGV or samplot. If you can share one chromosome (bam,vcf) then I can have a look.

AnneBoshove · 2022-01-17T14:21:18Z

I was having trouble attaching the files here, so i've send them to you in the mail. Maybe good to know that this alignment is done by mapping long reads against an assembly that was made with the same long reads, to find heterozygous SVs. Thanks for all the help!

brentp · 2022-01-18T11:48:01Z

thanks, can you also send the reference fasta via that service?

AnneBoshove · 2022-01-18T12:00:40Z

Yes, you should have received it

brentp · 2022-01-18T12:01:04Z

got it. will go through this in next few days.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duphold with Nanopore data? #45

Duphold with Nanopore data? #45

AnneBoshove commented Jan 14, 2022

brentp commented Jan 14, 2022

AnneBoshove commented Jan 14, 2022 •

edited by brentp

Loading

brentp commented Jan 14, 2022

AnneBoshove commented Jan 14, 2022

brentp commented Jan 14, 2022

AnneBoshove commented Jan 17, 2022

brentp commented Jan 18, 2022

AnneBoshove commented Jan 18, 2022

brentp commented Jan 18, 2022

Duphold with Nanopore data? #45

Duphold with Nanopore data? #45

Comments

AnneBoshove commented Jan 14, 2022

brentp commented Jan 14, 2022

AnneBoshove commented Jan 14, 2022 • edited by brentp Loading

brentp commented Jan 14, 2022

AnneBoshove commented Jan 14, 2022

brentp commented Jan 14, 2022

AnneBoshove commented Jan 17, 2022

brentp commented Jan 18, 2022

AnneBoshove commented Jan 18, 2022

brentp commented Jan 18, 2022

AnneBoshove commented Jan 14, 2022 •

edited by brentp

Loading