Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue assembling plant genome with NECAT #47

Open
LeoVincenzi opened this issue Jan 11, 2023 · 4 comments
Open

Issue assembling plant genome with NECAT #47

LeoVincenzi opened this issue Jan 11, 2023 · 4 comments

Comments

@LeoVincenzi
Copy link

Hi,
I'm working on a plant genome and I'm trying to assemble it with NECAT, but the final assembly I obtain is really inconsistent.
The expected genome size is 1.2 Gbp and I'm working with Oxford Nanopore reads. The starting data for the assembly are reported in the following table:

Number of reads 1,341,399
Number of bases (bp) 33,136,270,559
Average read length (bp) 24,703
Reads N50 (bp) 40,677
Expected fold-coverage 28x

The obtained results are the following:

  NECAT v.0.0.1
Total assembly size (bp) 604,869
Num. Contigs 12
Contigs average length (bp) 50,406
N50 (bp) 153,041
N90 (bp) 17,942
Longest contig (bp) 154,607

The command I run was
/opt/NECAT/Linux-amd64/bin/necat.pl assemble config.txt
and the config file was compiled as it follows:

PROJECT=Plant_genome
ONT_READ_LIST=read_list.txt
GENOME_SIZE=1200000000
THREADS=15
MIN_READ_LENGTH=3000
PREP_OUTPUT_COVERAGE=28
OVLP_FAST_OPTIONS=-n 500 -z 20 -b 2000 -e 0.5 -j 0 -u 1 -a 1000
OVLP_SENSITIVE_OPTIONS=-n 500 -z 10 -e 0.5 -j 0 -u 1 -a 1000
CNS_FAST_OPTIONS=-a 2000 -x 4 -y 12 -l 1000 -e 0.5 -p 0.8 -u 0
CNS_SENSITIVE_OPTIONS=-a 2000 -x 4 -y 12 -l 1000 -e 0.5 -p 0.8 -u 0
TRIM_OVLP_OPTIONS=-n 100 -z 10 -b 2000 -e 0.5 -j 1 -u 1 -a 400
ASM_OVLP_OPTIONS=-n 100 -z 10 -b 2000 -e 0.5 -j 1 -u 0 -a 400
NUM_ITER=1
CNS_OUTPUT_COVERAGE=28
CLEANUP=1
USE_GRID=true
GRID_NODE=8
GRID_OPTIONS=
SMALL_MEMORY=0
FSA_OL_FILTER_OPTIONS=
FSA_ASSEMBLE_OPTIONS=
FSA_CTG_BRIDGE_OPTIONS=
POLISH_CONTIGS=true

I would like to understand why the assembly obtained is so poor and how can I improve it. Maybe the parameters used for this dataset are inadequate?

@lemene
Copy link

lemene commented Jan 24, 2023

Hi,
28X is slightly less than the coverage of Nanopore reads expected by NECAT (>=40X). This affects the integrity of the assembly. Using the following parameters may improve the assembly.
FSA_OL_FILTER_OPTIONS=--min_coverage 2
2 can be replaced by 1 or 3.

The folders 4-fsa, 5-align_contigs and 6-bridge_contigs need to be renamed or deleted before running the command necat.pl bridge cfgfile. This will skip the error correction step and reassemble the corrected reads.

@LeoVincenzi
Copy link
Author

Hi,
thanks to your suggestion, we end up with an assembly of the desired size and with a high N50 value. I've also noticed that the 'bridge' improve the contiguity doubling the N50 that we could get from the 'assemble' step.
Anyway, I would like to ask you how the parameter FSA_OL_FILTER_OPTIONS affect the assembly: I suppose it is implied in the overlapping regions, but if we start from a high coverage (40x), why should we consider a minimum coverage with such a low value (1,2,3,..)?

@lemene
Copy link

lemene commented Mar 3, 2023

Hi @LeoVincenzi
The assembler calculates the coverage of each read. If the coverage is less than the threshold min_coverage, the read and the related overlaps are filtered out. The assembler can automatically calculate a value for it, but sometimes it is not appropriate. According to our experience, min_coverage = 3 is not a bad choice.

@lemene
Copy link

lemene commented Mar 3, 2023

Some raw reads are broken into multiple corrected reads in the error correction step. The unbroken raw reads are used to bridge the contigs, so the assembler can output the longer N50.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants