Skip to content

Alignment and Hit Call Info

Katrina Kalantar edited this page Feb 26, 2020 · 6 revisions

Q: Why might I be seeing many reads to NR but few reads to NT?

A: Especially for viral mutations, you may see many NR matches (still generating matched proteins) but no NT matches.

Q: Why might I be seeing many reads to NT but few reads to NR?

You may notice the unusual profile of, say, 100K+ reads aligning to a virus by NT and nothing by NR. This can be explained by all these reads mapping to a non-coding region of a virus (which is a little weird, but a real result). In this example, you may see a separate set of hits with opposite pattern (0-very low NT aligned reads & crazy high # reads aligned to NR) but this is only seen on IDseq report page (for both reads and contig results). The contig summary file indicates large contigs with high % sequence identity at nt level and high % contig covered. In these cases you may want to focus on the one with a significantly lower e-value (indicating higher significance). [Thanks AK]

Q: Is there documentation somewhere for what IDseq calls a particular read a hit in NT - i.e. what percent identity and over how long of a stretch of sequence?

A: We don't set constraints on the percent identity or alignment lengths other than alignment_length > 36, -0.25 < percent_identity < 100.25 (https://github.com/chanzuckerberg/idseq-dag/blob/master/idseq_dag/util/m8.py#L388) on our own code. Upstream the m8 file is generated in the alignment stage by GSNAP. Our GSNAP command is:

bin/gsnapl -A m8 --batch=0 --use-shared-memory=0 --gmap-mode=none --npaths=100 --ordered -t 36 --max-mismatches=40 -D {remote_index_dir} -d nt_k16 (https://github.com/chanzuckerberg/idseq-dag/blob/master/idseq_dag/steps/run_alignment_remotely.py#L238) so other params are at their defaults (https://wiki.gacrc.uga.edu/wiki/Gmap-gsnap).

In the new assembly steps, we use SPADES to assemble and BLAST to re-map the contigs:

spades.py -1 {input_fasta} -2 {input_fasta2} -o {assembled_dir} -m {memory} -t 32 --only-assembler https://github.com/chanzuckerberg/idseq-dag/blob/master/idseq_dag/steps/run_assembly.py#L30

and {blast_command} -query {assembled_contig} -db {blast_index_path} -out {blast_m8} -outfmt 6 -num_alignments 5 -num_threads 32 (https://github.com/chanzuckerberg/idseq-dag/blob/master/idseq_dag/steps/blast_contigs.py#L184)