Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duphold with Nanopore data? #45

Open
AnneBoshove opened this issue Jan 14, 2022 · 9 comments
Open

Duphold with Nanopore data? #45

AnneBoshove opened this issue Jan 14, 2022 · 9 comments

Comments

@AnneBoshove
Copy link

Hi,

I used duphold on some VCFs containing SVs called with Sniffles where the alignment was done using minimap2. The data used is all nanopore longread data. I was wondering if duphold is supposed to work properly on long read data? As quite a bit of the results seemed off/incorrect with very high numbers. Here's a sample of some of the results (parsed to obtain the info relevant for me):

SVtype | #CHROM | POS | End | SVLength | Precision | RE | Genotype | Consequence | GeneSymbol | EnsemblID | DHFC | DHFFC | DHBFC

DEL | Chr3 | 21648500 | 21946887 | -298387 | IMPRECISE | 10 | 01/Jan | coding_sequence_variant&5_prime_UTR_variant&intron_variant&feature_truncation | ARHGAP17 | ENSSSCG00000031488 | 181.481 | 101.031 | 178.182
DEL | Chr5 | 90902052 | 90907562 | -5510 | PRECISE | 39 | 0/1 | coding_sequence_variant&intron_variant&feature_truncation | CDK17 | ENSSSCG00000028182 | 0.963636 | 0.386861 | 0.946429
DEL | Chr2 | 122165466 | 122173123 | -7657 | PRECISE | 31 | 0/1 | coding_sequence_variant&intron_variant&feature_truncation | COMMD10 | ENSSSCG00000014223 | 161.818 | 143.548 | 161.818
INS | Chr6 | 13229037 | 13229037 | 46 | PRECISE | 21 | 0/1 | coding_sequence_variant&feature_elongation | GLG1 | ENSSSCG00000030420 | 0.814815 | 1 | 0.814815
DEL | Chr10 | 66054310 | 66056761 | -2451 | PRECISE | 62 | 0/1 | coding_sequence_variant&intron_variant&feature_truncation | ITIH5 | ENSSSCG00000011129 | 0.909091 | 0.438596 | 0.925926
DEL | Chr2 | 28721000 | 28772243 | -51243 | PRECISE | 69 | 0/1 | coding_sequence_variant&intron_variant&feature_truncation | KIAA1549L | ENSSSCG00000013311 | 143.636 | 0.153696 | 143.636
INV | Chr2 | 28705133 | 28707047 | 1914 | PRECISE | 130 | 0/1 | coding_sequence_variant&intron_variant | KIAA1549L | ENSSSCG00000013311 | 112.727 | 0.136564 | 114.815
INV | Chr2 | 28705133 | 28707047 | 1914 | PRECISE | 143 | 0/1 | coding_sequence_variant&intron_variant | KIAA1549L | ENSSSCG00000013311 | 112.727 | 0.136564 | 114.815
DEL | ChrX | 90375498 | 90376075 | -577 | PRECISE | 232 | 01/Jan | coding_sequence_variant&intron_variant&feature_truncation | MID2 | ENSSSCG00000012564 | 117.857 | 0.103774 | 117.857
INS | Chr13 | 135062890 | 135063265 | 1039 | IMPRECISE | 20 | 0/1 | coding_sequence_variant&intron_variant&feature_elongation | MUC13 | ENSSSCG00000011862 | 0.75 | 0.976744 | 0.792453

Thanks in advance,

Kind regards,
Anne

@brentp
Copy link
Owner

brentp commented Jan 14, 2022

Hi, yes, duphold should work very well on long read data.
a lot of those do seem extreme. How deep is the coverage for this sample?
I would look at a few of them in IGV.
If this is the worst of thousands of variants, it doesn't seem too surprising and is probably because of weird regions (I see MUC13 in there, for example). If this is a random subset of variants or representative of the results, then something else must be going on.

@AnneBoshove
Copy link
Author

AnneBoshove commented Jan 14, 2022

Thanks for your reply @brentp

The coverage should be around 120X as 3 flowcells were used in the sequencing.
Unfortunately this is just a random sample so not the worst, i've seen several values around 300/400, the highest i've seen was around 900. Values around 100 are quite common in my output. Looking at the results in a genome browser is a bit difficult for me as my device isn't the greatest and struggles with loading/displaying such large files, but I will try.

Could it be that something is off about the format of my VCF that I used as input for duphold? They were ran through sniffles and then VEP, i'll give a few sample lines:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  LWlongreadsMD.sorted.bam
Chr1    267634  0       N       AGTGTCTCATGTCCGGGTCCGTATCCGTGTCCCATGTCCATGTCCCGTGTCCGTGTCCACGTTGTCTCCATGTCCCGTGTCCGTGTCTCATGTCGGCGTCCGTGTCTCATGTCCTGCTCGTGTCCGTGTCTCAGGTCTGGGTCCTGTGTCCGGGTCCGTGTCCCGTGTCCCGTGTCCGTGTCCTGTCCACAGTGTCCGTGTCTCATGTCCGGGTCCCGTGTCCGTGTCCCATGTCCATGTCCAG    .       PASS    IMPRECISE;SVMETHOD=Snifflesv1.0.12;CHR2=Chr1;END=267781;STD_quant_start=12.555591;STD_quant_stop=74.402381;Kurtosis_quant_start=-1.235287;Kurtosis_quant_stop=-1.844707;SVTYPE=INS;SUPTYPE=AL;SVLEN=246;STRANDS=+-;STRANDS2=19,10,19,10;RE=29;REF_strand=36,21;Strandbias_pval=1;AF=0.337209;CSQ=insertion|downstream_gene_variant|MODIFIER||ENSSSCG00000041881|Transcript|ENSSSCT00000070783|lncRNA|||||||||||1725|1||||LW.gtf.gz|,insertion|intron_variant&non_coding_transcript_variant&feature_elongation|MODIFIER||ENSSSCG00000041881|Transcript|ENSSSCT00000072648|lncRNA||1/1||||||||||1||||LW.gtf.gz|,insertion|intron_variant&non_coding_transcript_variant&feature_elongation|MODIFIER||ENSSSCG00000041881|Transcript|ENSSSCT00000091536|lncRNA||1/1||||||||||1||||LW.gtf.gz| GT:DR:DV        0/1:57:29
Chr1    267767  1       TGGGTCCTGTGTCCGTGTCCCATGTCCATGTCCCGTGTCCGTGTCTCAGGT     N       .       PASS    PRECISE;SVMETHOD=Snifflesv1.0.12;CHR2=Chr1;END=267818;STD_quant_start=3.683942;STD_quant_stop=3.693624;Kurtosis_quant_start=0.301027;Kurtosis_quant_stop=0.380196;SVTYPE=DEL;SUPTYPE=AL;SVLEN=-51;STRANDS=+-;STRANDS2=17,11,17,11;RE=28;REF_strand=36,21;Strandbias_pval=1;AF=0.329412;CSQ=deletion|downstream_gene_variant|MODIFIER||ENSSSCG00000041881|Transcript|ENSSSCT00000070783|lncRNA|||||||||||1858|1||||LW.gtf.gz|,deletion|intron_variant&non_coding_transcript_variant&feature_truncation|MODIFIER||ENSSSCG00000041881|Transcript|ENSSSCT00000072648|lncRNA||1/1||||||||||1||||LW.gtf.gz|,deletion|intron_variant&non_coding_transcript_variant&feature_truncation|MODIFIER||ENSSSCG00000041881|Transcript|ENSSSCT00000091536|lncRNA||1/1||||||||||1||||LW.gtf.gz|  GT:DR:DV        0/1:57:28
Chr1    294446  2       GTGTGCGTGTGTTGTGTGTGGGGTGTGTGTGTGTGTGGGGTGTGTGTGTGGTGTGTGTGTTGTGTGTTTGTGGTGTGTGTGGTGTGTATGTG    N       .       PASS    IMPRECISE;SVMETHOD=Snifflesv1.0.12;CHR2=Chr1;END=294538;STD_quant_start=17.472056;STD_quant_stop=21.153121;Kurtosis_quant_start=-0.231708;Kurtosis_quant_stop=0.206432;SVTYPE=DEL;SUPTYPE=AL,NR;SVLEN=-92;STRANDS=+-;STRANDS2=17,5,17,5;RE=22;REF_strand=33,19;Strandbias_pval=0.288655;AF=0.297297;CSQ=deletion|intergenic_variant|MODIFIER||||||||||||||||||||||      GT:DR:DV        0/0:52:22

@brentp
Copy link
Owner

brentp commented Jan 14, 2022

Hi, perhaps you can plot a few of these with samplot?
If you show the result here, I might be able to tell more what's going on.

@AnneBoshove
Copy link
Author

I managed to look at the data in JBrowse and looked at a few of the SVs with high numbers;

afbeelding

afbeelding

This is the following DUP:

DUP | Chr6 | 165030128 | 165048325 | 18197 | IMPRECISE | 149 | 0/1 | transcript_amplification | ENSSSCG00000003889
DHFC: 962.963 | DHFFC: 597.701 | DHBFC: 981.132

And here's some deletions

DEL Chr13 73763246 73770714 -7468 PRECISE 34 0/1 transcript_ablation ENSSSCG00000045067
DHFC: 310.714 DHFFC: 263.636 DHBFC: 316.364

afbeelding

DEL Chr2 122165466 122173123 -7657 PRECISE 31 0/1 coding_sequence_variant&intron_variant&feature_truncation COMMD10 ENSSSCG00000014223 161.818 143.548 161.818

afbeelding

Only thing that seems a bit strange to me it that those deletions have a higher coverage, I don't see what it could be about the duplication that gives it such high values though.

@brentp
Copy link
Owner

brentp commented Jan 14, 2022

Yes, something is wrong. Would be nice to see these in IGV or samplot. If you can share one chromosome (bam,vcf) then I can have a look.

@AnneBoshove
Copy link
Author

I was having trouble attaching the files here, so i've send them to you in the mail. Maybe good to know that this alignment is done by mapping long reads against an assembly that was made with the same long reads, to find heterozygous SVs. Thanks for all the help!

@brentp
Copy link
Owner

brentp commented Jan 18, 2022

thanks, can you also send the reference fasta via that service?

@AnneBoshove
Copy link
Author

Yes, you should have received it

@brentp
Copy link
Owner

brentp commented Jan 18, 2022

got it. will go through this in next few days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants