-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Too good to be true? understanding hapdup results #42
Comments
Hi @cahuparo It does seems a bit too good to be true.. In theory, if you have enough heterogeneity, you can phase entire chromosomes but it's hard to believe there were not phasing errors. Hapdup actually does produce the phase block coordinates as output - so you can see how fragmented these are. You mentioned you are extracting k-mers from Illumina data - is it trio (e.g. paternal and maternal sequencing)? Does the process of building Meryl database has access to hapdup assmeblies? Best, |
Hi @fenderglass, Thanks! I am glad I asked you about this. Here I am focusing on both the evaluation of phased block coordinates (1) and the creation and validation of k-mer databases (2). (1) Evaluating Phased Block CoordinatesUpon inspecting the phased block coordinates produced by HapDup, we discovered entries with negative block sizes, indicating start positions greater than the end positions for some blocks. This was an unexpected finding that suggested potential data or processing issues but I am not an expert on this so I wanted to get your opinion:
Are these the phasing errors? Is it weird that there only this many? Are these the a result of inversion misassemblies or artefacts of the phasing process? (2) Revisiting the entire approach including the kmer database creation:Yes, we used illumina data for that specific strain. No, we don't have trio data. Here is a more detail explanation of my approach: Step 1: Genome Assembly with Canu
Step 2: Purging Duplicates
Step 3: Generate a Database of Repetitive Elements AND Polishing with NextPolish2NextPolish2 is used for genome assembly polishing to improve assembly quality using both long reads (ONT) and short reads (Illumina). Generate a Database of Repetitive Elements: This step helps in optimizing the mapping of reads, especially in repetitive regions of the genome:
Polishing with NextPolish2: Use the repetitive elements information during read mapping for polishing.
Identifying and managing repetitive elements before polishing can significantly reduce the chances of misassembly or errors in regions with high sequence similarity. By informing the polishing process about these repetitive elements, NextPolish2 can more accurately use the read data for correcting the assembly, leading to a higher quality final genome sequence. "In therory" Step 4: Scaffolding with RagTag
Step 5: Haplotype Resolution with HapDup
Step 6: Haplotype Assembly Evaluation with MerquryThe evaluation process involved the following steps: Generating K-mer Databases:K-mer databases are generated for both the sequencing reads (ONT and PE) and the haplotyped assemblies using Meryl.
For Haplotyped Assemblies:
Merqury Analysis:Merqury performs quality evaluation by comparing the k-mer composition of the assemblies against the k-mer composition derived from sequencing reads. This comparison helps identify discrepancies and assess the quality and completeness of the assemblies.
finally run merqury:
Thank you for your time and consideration in this complex yet fascinating endeavor towards understanding haplotype-resolved genome assemblies. Best, Camilo |
Hi Camilo, Negative phased blocks are definitely unexpected - do you see those in a file in the Margin output dir that ends with |
Hi Misha, See attached the bed file in txt format because the bed extension can't be attached. For some of those negative phased blocks:
What does that mean? Could you help me understand what is wrong and if the results can be trusted? Thanks for your time! Camilo |
Hi Camilo, It seems like a potential issue with Margin output - may be some kind of borderline case. But it should not affect the assembly results, as long as the phasing stats look reasonable (e.g. phasing N50 few 100s kb to Mb, phased block length comparable to genome size). |
Hi @fenderglass,
I've adopted a strategy using Illumina reads alongside ONT R10 data to construct and evaluate phased genome assemblies. After assembling with
canu
, I followed withpurged_dups
,NextPolish2
, andRagTag,
lastly I usedhapdup
to produce a dual assembly. I'm now in the process of quality assessment for these "dual" assemblies. This is a diploid genome of highly heterozygous plant pathogen.Workflow and Issue Description:
My workflow integrates
meryl
to derive unique hap-mer databases, followed by the generation of blob plots throughMerqury
. However, the resulting plots are exceptionally clean, leading to doubts about the phasing precision and hap-mer authenticity.Approach for Hapmer Creation and Blob Plot Generation:
Hap-mer Creation:
meryl
to count k-mers from Illumina data to create a k-mer database.meryl difference
, producinghap1_unique.meryl
andhap2_unique.meryl
.meryl intersect
to find overlaps with PE reads, creating ahap1_pe_intersect.meryl
(similarly for hap2).meryl union-sum
to get a combined set of hap-mer databases.Blob Plot Creation:
Merqury
was run to assess the quality of the dual assemblies.The concern arises when the blob plots show an overly distinct separation of hap-mers, potentially indicating over-filtering or other issues in the hap-mer generation pipeline.
Concern:
The unusually distinct separation of hap-mers in the blob plots makes me think that it is "too good to be meaningful", indicating potential over-filtering or errors in the hap-mer derivation.
Request for Input:
I'm reaching out for guidance on deciphering these blob plots and for suggestions on extra validation procedures. I'm particularly interested in any methodological alterations that could mitigate over-filtering or adapt to the unavailability of parental genomes.
Specific Questions:
Thank you for your time and consideration,
Camilo
PS. Here are the commands I used to run
hapdup
:Here are some stats of the assemblies from hapdup:
And these are some quick syntenic relationships between dual hap1 an hap2:
The text was updated successfully, but these errors were encountered: