Skip to content

Latest commit

 

History

History
159 lines (123 loc) · 9.83 KB

MarginPolish.md

File metadata and controls

159 lines (123 loc) · 9.83 KB

MarginPolish

MarginPolish is a diploid-aware assembly polisher. It takes as input a FASTA assembly and an indexed BAM (ONT reads aligned to the assembly), and it produces a haploid or diploid polished FASTA assembly. While Margin's polish submodule serves as a standalone assembly polisher, it is also part of an assembly pipeline which includes an ultrafast nanopore assembler Shasta and a multi-task RNN polisher HELEN. HELEN operates on images generated by Margin to generate a polished haploid sequence.

Functional Overview

Margin's polish submodule takes reads and alignments from the BAM and generates an initial graph describing matches, inserts, and deletes observed in the alignments. It aligns each read to this graph and determines the probabilities for multiple likely alignments. After all reads are aligned, it generates weighted alignment scores for each node in the graph and uses these to determine a most-likely path through the graph. This path becomes the inital graph for the next iteration of the process. Iteration continues until the total weighted likelihood of alignments decreases between iteration steps, or until a configured maximum iteration count is reached.

All of these alignments are done in run-length space, which simplifies and cleans the alignments. This reduces effects from errors in homopolymer runs, which are the primary source of error in nanopore reads. After determining a most-likely run-length-encoded sequence, it expands the run-lengths to generate a final consensus sequence. This expansion is done using a Bayesian model which predicts the most likely true run-length from all run-length observations in the reads aligned to the node.

In diploid polishing mode, Margin identifies candidate sites where variants may be present (alternatively, a VCF can be supplied and the tool will use these sites instead of its own detection strategy). It finds read substrings aligned around these sites, uses all unique read substrings as candidate alleles, and aligns all read substrings to all candidate alleles. These alleles and the read alignments to them are used to construct a graph containing read linkage information between sites. The phasing HMM is applied to this graph, and we use this to identify the most likely alleles at each site and the haplotype to which they belong. After all chunks are complete, Margin stitches haplotypes together using a set similarity metric between reads assigned to the adjacent chunks' haplotypes, and stitches sequences together using the haploid stitching method.

Assembly Workflow

For a detailed description of the end-to-end assembly workflow, see the documentation provided in the HELEN repository.

Sample Execution

# haploid mode
./margin polish \
        /path/to/margin/tests/data/realData/HG002.r94g360.chr20_59M_100k.bam \
        /path/to/margin/tests/data/realData/hg38.chr20_59M_100k.fa \
        /path/to/margin/tests/data/realData/params/ont/r9.4/allParams.np.human.r94-g360.json
        
# diploid mode
./margin polish \
        /path/to/margin/tests/data/realData/HG002.r94g360.chr20_59M_100k.bam \
        /path/to/margin/tests/data/realData/hg38.chr20_59M_100k.fa \
        /path/to/margin/params/ont/r9.4/allParams.np.human.r94-g360.json \
        --diploid
        -v /path/to/margin/tests/data/realData/HG002.r94g360.chr20_59M_100k.vcf

Running MarginPolish

Full parameter set can be found from the help message:

$ ./margin polish
usage: margin polish <BAM_FILE> <ASSEMBLY_FASTA> <PARAMS> [options]
Version: 2.0

Polishes the ASSEMBLY_FASTA using alignments in BAM_FILE.

Required arguments:
    BAM_FILE is the alignment of reads to the assembly (or reference).
    ASSEMBLY_FASTA is the reference sequence BAM file in fasta format.
    PARAMS is the file with marginPolish parameters.

Default options:
    -h --help                : Print this help screen
    -a --logLevel            : Set the log level [default = info]
    -t --threads             : Set number of concurrent threads [default = 1]
    -o --outputBase          : Name to use for output files [default = 'output']
    -r --region              : If set, will only compute for given chromosomal region
                                 Format: chr:start_pos-end_pos (chr3:2000-3000)
    -p --depth               : Will override the downsampling depth set in PARAMS
    -k --tempFilesToDisk     : Write temporary files to disk (for --diploid or supplementary output)

Diploid options:
    -2 --diploid             : Will perform diploid phasing.
    -v --vcf                 : VCF with sites for phasing (will not perform variant detection if set)
    -S --skipFilteredReads   : Will NOT attempt to haplotype filtered reads (--diploid only)
    -R --skipRealignment     : Skip realignment (for haplotyping only)
    -A --onlyVcfAlleles      : Only use alleles specified in the VCF. Requires NO RLE and 
                                 Requires NO RLE and --skipOutputFasta

HELEN feature generation options:
    -f --produceFeatures     : output splitRleWeight or diploidRleWeight (based on -2 flag) features for HELEN
    -F --featureType         : output specific feature type for HELEN (overwrites -f).  Valid types:
                                 splitRleWeight:   [default] run lengths split into chunks
                                 channelRleWeight: run lengths split into per-nucleotide channels
                                 simpleWeight:     weighted likelihood from POA nodes (non-RLE)
                                 diploidRleWeight: [default] produces diploid features 
    -L --splitRleWeightMaxRL : max run length (for RLE feature types) 
                                 [split default = 10, channel default = 10]
    -u --trueReferenceBam    : true reference aligned to ASSEMBLY_FASTA, for HELEN
                               features.  Setting this parameter will include labels
                               in output.  If -2/--diploid is set, this parameter must
                               contain two comma-separated values

Miscellaneous supplementary output options:
    -c --supplementaryChunks : Write supplementary files for each chunk (in additon to writing
                               whole genome information)
    -d --outputPoaDot        : Write out the poa as DOT file (only done per chunk)
    -i --outputRepeatCounts  : Write out the repeat counts as CSV file
    -j --outputPoaCsv        : Write out the poa as CSV file
    -n --outputHaplotypeReads: Write out phased reads and likelihoods as CSV file (--diploid only)
    -s --outputPhasingState  : Write out phasing likelihoods as JSON file (--diploid only)
    -M --skipHaplotypeBAM    : Do not write out phased BAMs (--diploid only, default is to write)
    -T --skipOutputFasta     : Do not write out phased fasta (--diploid only, default is to write)

Configuration

Margin uses a JSON configuration file which contains model and runtime parameterization for:

  • Pairwise Alignment
  • Graph Alignment
  • Run Length Estimation
  • Chunking
  • Read/Depth Management

For polishing, the run-length estimation correctness is tied to the basecaller version that was used to generate the reads. We have trained models for the ONT r9.4 and r10.3 pores with multiple versions of the Guppy basecaller for human and microbial samples, provided in the ont folder. Additionally we have provided a model trained on PacBio HiFi reads at pacbio/hifi/allParams.hifi.json.

params/
├── base_params.json
├── ont
│   ├── r10.3
│   │   ├── allParams.np.human.r103-g3210.json
│   │   └── allParams.np.microbial.r103g324.json
│   └── r9.4
│       ├── allParams.np.human.r94-g235.json
│       ├── allParams.np.human.r94-g305.json
│       ├── allParams.np.human.r94-g344.json
│       ├── allParams.np.human.r94-g360.json
│       ├── allParams.np.microbial.r94-g305.json
│       └── allParams.np.microbial.r94-g344.json
└── pacbio
    └── hifi
        └── allParams.hifi.json

HELEN Image Generation

HELEN is a multi-task RNN polisher which operates on images produced by MarginPolish. The images summarize the state of the nodes in the alignment graph before run-length expansion. They include the weights associated with read observations aligned at each node.

If MarginPolish is configured to generate images (the -f option), it will output a single .h5 file for each thread.

MarginPolish produces different image types (used during development) which can be configured with the -F flag, but users must use the default type 'splitRleWeight' for the trained models HELEN provides.

To produce HELEN training images, run MarginPolish with the -u flag. This takes an argument of an indexed BAM alignment of the truth sequence to the assembly. MarginPolish will extract the alignments from this BAM for each analyzed chunk. If there is a single alignment at this location with a sequence that approximately matches the chunk's size, it is used to label the images for both nucleotide and run-length.

Resource Requirements

While comprehensive resource usage profiling has not been done yet, we find that memory usage scales linearly with thread count, read depth, and chunk size. For this reason, our default parameters downsample read depth to 64 and restrict chunk size to 100000 bases.

We found that 2GB of memory per thread is sufficient to run Margin's polish submodule in haploid mode on genome-scale assemblies and alignment.

Across 13 whole-genome runs, we averaged roughly 350 CPU hours per gigabase of assembled sequence.

© 2019 by Benedict Paten (benedictpaten@gmail.com), Trevor Pesout (tpesout@ucsc.edu)