Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto get ref genes, variant handling #41

Merged
merged 166 commits into from
Feb 18, 2016
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
166 commits
Select commit Hold shift + click to select a range
cae568b
Move genetic_code into refcheck.Checker class
Dec 8, 2015
6555288
Add ref_genes_getter module
Dec 14, 2015
20a0868
Remove bs4 dependency as pip cannot find it
Dec 14, 2015
b166671
Add module amin_acid_variant
Dec 14, 2015
ceb891e
Require beautifulsoup4
Dec 16, 2015
00ea328
Rename amino_acid_variant to sequence_variant
Dec 17, 2015
d394cce
Rename agrees_with_protein_seq to sanity_check_against_seq
Dec 17, 2015
d9f72a6
Add sequence_metadata
Dec 17, 2015
8889943
Store variant types as strings; add trnaslate_seq option
Dec 17, 2015
9cb44da
Add __str__()
Dec 17, 2015
4e72a47
Add __eq__ and __le__
Dec 21, 2015
52ecebe
Add __eq__ and __le__
Dec 21, 2015
fe8a67b
Add reference_data module
Dec 21, 2015
e9173be
Add gene checking functionality
Dec 21, 2015
2f2d242
Use new reference_data instead of refcheck.Checker
Dec 21, 2015
c818156
Remove refcheck; make card download format match new sequence_metadat…
Dec 22, 2015
e925be6
Fix typo in filename
Dec 22, 2015
9641235
warn with metadata problem instead of die
Dec 23, 2015
c2f15d4
Add length info to bad genes error message
Dec 23, 2015
6b2127c
Track which genes removed, so clearer output for user
Dec 23, 2015
a16dc3e
Do not try to reverse complement genes
Dec 23, 2015
2ef8d2c
Tidy up log lines of removed sequences
Dec 23, 2015
f049b53
Add method make_catted_fasta
Dec 23, 2015
12ff539
Add test files for make_catted_fasta
Jan 4, 2016
1074d61
Change to use ReferenceData
Jan 4, 2016
b402ac8
Add method sequence_type
Jan 4, 2016
708451d
Add methods sequence and sequence_length
Jan 4, 2016
6bc452a
Add method has_variant
Jan 4, 2016
eda4a9f
Add method has_variant
Jan 7, 2016
7f6f1f9
Split Cluster into several classes. This breaks everything. Now fix i…
Jan 11, 2016
dba45c0
Typo fix
Jan 11, 2016
da84253
typo fix
Jan 11, 2016
fd32063
remove line for now, has syntax error
Jan 11, 2016
e17b14e
Tests pass
Jan 11, 2016
aec5949
Tests pass
Jan 11, 2016
d1f30d4
Add missing modules
Jan 11, 2016
d61a163
Remove unneeded test files
Jan 11, 2016
5b4dd21
Tests pass
Jan 11, 2016
18709f3
Tests pass
Jan 11, 2016
c9cc75c
Close file before raising error
Jan 11, 2016
3928db1
Tests pass
Jan 11, 2016
dfbd6e4
Fixes so counting reads works
Jan 25, 2016
fc6cb57
test_init_fail_files_missing passes
Jan 25, 2016
cf2236d
Remove commented notes
Jan 25, 2016
3f10883
Fix all_non_wild_type_variants
Jan 25, 2016
4ae6c57
Implement hash function
Jan 25, 2016
bf71527
Change output to be hash
Jan 25, 2016
f87a7b2
Use lt instead of le for sorting
Jan 26, 2016
d326d82
Use lt instead of le for sorting
Jan 26, 2016
636499e
Store in dicts for easy lookups later
Jan 26, 2016
d6c9c66
Store self.nucmer_snps_file
Jan 29, 2016
9a83d96
Add method get_variants
Jan 29, 2016
b352911
Rename _nucmer_hits_to_ref_coords nucmer_hits_to_ref_coords
Jan 29, 2016
a3d0317
Do not need bam file
Jan 29, 2016
b68834d
Rename _ref_cov_per_contig ref_cov_per_contig
Jan 29, 2016
40e6c15
Spelling typos fix
Feb 1, 2016
b1213dc
Catch when no contigs file written
Feb 1, 2016
1f1ab6e
New flag ref_seq_choose_fail
Feb 1, 2016
c6efde1
Add method get_depths_at_position
Feb 1, 2016
97d327b
Add tets files needed for previous commit
Feb 1, 2016
06ec1ec
nucmer_hits_to_ref_coords return dict; ref_covered_by_complete_contig…
Feb 1, 2016
bb7893b
Bug fixes when doing complete run
Feb 1, 2016
8ca1cfc
Rewrite of cluster, split into separate modules
Feb 1, 2016
3bb6013
Remove uneeded test files
Feb 1, 2016
2e6e03a
Renmove gene_ from flag
Feb 2, 2016
2cb917d
Do not give noncoding seqs the compelte_orf flag
Feb 2, 2016
e327698
Add to_string method
Feb 2, 2016
1708107
Return variant in a list, for consistency
Feb 2, 2016
6537605
Bug fixes making noncoding report lines
Feb 2, 2016
f25f081
Return nucletoide change string
Feb 2, 2016
86d68c6
bug fix reporting number of assembled bases
Feb 2, 2016
f303b8e
Report variant type for noncoding variants
Feb 2, 2016
be765bd
Test all combinations of noncoding snps
Feb 2, 2016
c411d47
Should be same as references.fa in cluster dir
Feb 2, 2016
d117be5
Add tests for noncoding indels
Feb 2, 2016
3269943
List elements should be intervals, not nucmer alignments
Feb 3, 2016
455255b
Add test full run presence absence cluster
Feb 3, 2016
6017e2d
remove keys when values empty
Feb 3, 2016
a8d46b4
Add tests for variants_only cluster runs
Feb 3, 2016
ed4a7a3
Remove unused imports
Feb 3, 2016
442d0e7
Simplify by not renaming, use -d option to cdhit instead
Feb 3, 2016
22ac5f4
New nethod cluster_with_cdhit
Feb 3, 2016
4781252
Write cluster allocation file
Feb 4, 2016
30b1aa3
rewrite started. tests pass. write_report fixed
Feb 4, 2016
47e5b1c
Tidy up getting files from argannot
Feb 4, 2016
dd4487a
Tidy up getting files from resfinder
Feb 4, 2016
5e3627d
Add author and date to citation
Feb 4, 2016
ce23995
Tidy up getting files from card
Feb 4, 2016
38209e4
More tidy up getting files from card
Feb 4, 2016
c835f19
New task getref
Feb 4, 2016
d775d86
Add method write_seqs_to_fasta
Feb 5, 2016
661d68f
Set filehandle to None so pickling works
Feb 5, 2016
09f5670
Smalt no longer used
Feb 5, 2016
a680dbe
Set filehandle to None so pickling works
Feb 5, 2016
4069101
Velvet no longer used
Feb 5, 2016
6fad31b
Small fixes so now whole thing runs
Feb 5, 2016
3db70ee
Update to reflect rewrite of clusters.py
Feb 5, 2016
a3896d9
Bug fix writing report
Feb 5, 2016
426057f
Use new smatools sort syntax
Feb 5, 2016
1ec9441
Add method nucmer_hit_containing_reference_position
Feb 9, 2016
9c9054f
Add method nucmer_hit_containing_reference_position
Feb 9, 2016
0d481c4
Add method nucleotide_range
Feb 9, 2016
0dbd4d2
Report read depths etc on variants wild both ref and assembly
Feb 10, 2016
0a457cf
Die if input fastas empty
Feb 10, 2016
89eefb7
typo bug fix in __lt__
Feb 10, 2016
cbe20ad
Pass samtools
Feb 10, 2016
8302284
Skip seqs that did not make cluster
Feb 10, 2016
ea27757
Be verbose, when applicable
Feb 10, 2016
154e4e3
Be verbose with cdhit
Feb 10, 2016
98ca5fd
Put sequence in dummy fa file to stop crashing
Feb 10, 2016
ef52e20
Add option to run test data
Feb 10, 2016
b0ce7db
Tidy up output to log file
Feb 10, 2016
00a5f2a
Remove unsorted bam
Feb 10, 2016
6570c00
Clean files
Feb 10, 2016
418d656
Check assembly dir exists before trying to delete it
Feb 11, 2016
7685639
unsorted bam already deleted
Feb 11, 2016
54ba630
Remove smalt
Feb 11, 2016
eb94d3f
Rewrite external_progs module
Feb 11, 2016
9bae022
New always_report column in metadata
Feb 11, 2016
9a7816d
Add new always_report column to metadata
Feb 11, 2016
5fe29b7
Add new always_report column to metadata
Feb 11, 2016
40cba58
Add new always_report column to metadata
Feb 11, 2016
95ad759
Add new always_report column to output tsv
Feb 11, 2016
a1f474e
Force getting always_report=Y variants
Feb 11, 2016
0476e8c
Add test for always_ouput variants
Feb 15, 2016
4f381cd
Always report variants_only, even when variant not found
Feb 15, 2016
a913ae0
Report variants_only, whenever any reads mapped
Feb 15, 2016
8558b4f
Update to report variants_only when variant not there
Feb 15, 2016
cc9752b
Write assembled ref seqs to fasta file
Feb 15, 2016
27271c7
Remove commented out code
Feb 15, 2016
00324b9
Remove old refcheck
Feb 15, 2016
3dec14d
Remove old refcheck
Feb 15, 2016
09a2299
Report rewrite, fewer lines and more columns
Feb 16, 2016
ae0386b
Change report format. Again.
Feb 17, 2016
08dbfa5
Add missing comma
Feb 17, 2016
458e927
Fix missing columns with a . and always assert correct number of columns
Feb 17, 2016
284f2dc
Rename cluster_rep column
Feb 17, 2016
d698d21
fix line2dict to work with new report
Feb 17, 2016
3483d0e
Make line2dict a classmethod
Feb 17, 2016
9ff620a
fix _load_file for new report.tsv format
Feb 17, 2016
1d720fd
Store as dict of dicts
Feb 17, 2016
cef060d
refactor _pc_id_of_longest
Feb 17, 2016
2a5b818
refactor _to_summary_number -> _to_summary_number_for_seq
Feb 17, 2016
703c79e
Add _to_summary_number_for_variant
Feb 17, 2016
f40c34e
fix _gather_output_rows for new report.tsv format
Feb 17, 2016
f33218c
Make filter_output_rows a classmethod
Feb 17, 2016
2a745f2
make _write_tsv a classmethod
Feb 17, 2016
ef6b72f
Convert dict to string for printing
Feb 17, 2016
39cb54b
Should be a string
Feb 17, 2016
3da951c
Make classmethod
Feb 17, 2016
ecce0cf
Make classmethod
Feb 17, 2016
6fcb36b
Make classmethod
Feb 17, 2016
0035f96
Use classmethods
Feb 17, 2016
cad485a
Update run() to work with previous commits
Feb 17, 2016
93fdaba
Update with new column names
Feb 17, 2016
0124d36
Reinstate unit test
Feb 17, 2016
d3aab40
Add ";var" into column names that are variants
Feb 18, 2016
1af7813
Rename JS candy to phandango
Feb 18, 2016
4f6d802
tell phandango to colour columns consistently
Feb 18, 2016
4af261c
Add N in required column
Feb 18, 2016
e0d462d
use hit_both_strands flag
Feb 18, 2016
32c4946
Use has_nonsynonymous_variants flag
Feb 18, 2016
b06a360
Bug fix when file not exists
Feb 18, 2016
995edea
Filter small contigs
Feb 18, 2016
cfbf573
Bux fig getting has_nonynonymous_variants flag
Feb 18, 2016
0510055
Discard read pairs where both reads unmapped
Feb 18, 2016
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 9 additions & 1 deletion ariba/__init__.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,9 @@
__all__ = [
'assembly',
'assembly_compare',
'assembly_variants',
'bam_parse',
'best_seq_chooser',
'cdhit',
'cluster',
'clusters',
Expand All @@ -10,8 +14,12 @@
'histogram',
'link',
'mapping',
'refcheck',
'reference_data',
'ref_genes_getter',
'report',
'scaffold_graph',
'sequence_metadata',
'sequence_variant',
'summary',
'tasks',
]
Expand Down
313 changes: 313 additions & 0 deletions ariba/assembly.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,313 @@
import os
import shutil
import pyfastaq
import pymummer
from ariba import common, mapping, bam_parse, external_progs

class Error (Exception): pass

class Assembly:
def __init__(self,
reads1,
reads2,
ref_fasta,
working_dir,
final_assembly_fa,
final_assembly_bam,
log_fh,
scaff_name_prefix='scaffold',
kmer=0,
assembler='spades',
bowtie2_preset='very-sensitive-local',
max_insert=1000,
min_scaff_depth=10,
min_scaff_length=50,
nucmer_min_id=90,
nucmer_min_len=50,
nucmer_breaklen=50,
spades_other_options=None,
sspace_k=20,
sspace_sd=0.4,
reads_insert=500,
extern_progs=None,
):
self.reads1 = os.path.abspath(reads1)
self.reads2 = os.path.abspath(reads2)
self.ref_fasta = os.path.abspath(ref_fasta)
self.working_dir = os.path.abspath(working_dir)
self.final_assembly_fa = os.path.abspath(final_assembly_fa)
self.final_assembly_bam = os.path.abspath(final_assembly_bam)
self.log_fh = log_fh
self.scaff_name_prefix = scaff_name_prefix

self.assembly_kmer = self._get_assembly_kmer(kmer, reads1, reads2)
self.assembler = assembler
self.bowtie2_preset = bowtie2_preset
self.max_insert = max_insert
self.min_scaff_depth = min_scaff_depth
self.min_scaff_length = min_scaff_length
self.nucmer_min_id = nucmer_min_id
self.nucmer_min_len = nucmer_min_len
self.nucmer_breaklen = nucmer_breaklen
self.spades_other_options = spades_other_options
self.sspace_k = sspace_k
self.sspace_sd = sspace_sd
self.reads_insert = reads_insert

if extern_progs is None:
self.extern_progs = external_progs.ExternalProgs()
else:
self.extern_progs = extern_progs

try:
os.mkdir(self.working_dir)
except:
raise Error('Error mkdir ' + self.working_dir)

self.assembler_dir = os.path.join(self.working_dir, 'Assemble')
self.assembly_contigs = os.path.join(self.working_dir, 'contigs.fa')
self.scaffold_dir = os.path.join(self.working_dir, 'Scaffold')
self.scaffolder_scaffolds = os.path.join(self.working_dir, 'scaffolds.fa')
self.gapfill_dir = os.path.join(self.working_dir, 'Gapfill')
self.gapfilled_scaffolds = os.path.join(self.working_dir, 'scaffolds.gapfilled.fa')
self.gapfilled_length_filtered = os.path.join(self.working_dir, 'scaffolds.gapfilled.length_filtered.fa')


@staticmethod
def _get_assembly_kmer(k, reads1, reads2):
'''If the kmer not given, uses 2/3 of the mean read length (using first 1000 forward and first 1000 reverse reads)'''
if k == 0:
read_length1 = pyfastaq.tasks.mean_length(reads1, limit=1000)
read_length2 = pyfastaq.tasks.mean_length(reads2, limit=1000)
assembly_kmer = round( (read_length1 + read_length2) / 3)
if assembly_kmer % 2 == 0:
assembly_kmer += 1
else:
assembly_kmer = k

return assembly_kmer


def _assemble_with_spades(self, unittest=False):
cmd = ' '.join([
self.extern_progs.exe('spades'),
'-1', self.reads1,
'-2', self.reads2,
'-o', self.assembler_dir,
'-k', str(self.assembly_kmer),
'--untrusted-contigs', self.ref_fasta,
])
if self.spades_other_options is not None:
cmd += ' ' + self.spades_other_options

cwd = os.getcwd()
try:
os.chdir(self.working_dir)
except:
raise Error('Error chdir ' + self.working_dir)
spades_contigs = os.path.join(os.path.split(self.assembler_dir)[1], 'scaffolds.fasta')

if unittest:
os.mkdir(self.assembler_dir)
open(spades_contigs, 'w').close()
self.assembled_ok = True
else:
self.assembled_ok, err = common.syscall(cmd, verbose=True, allow_fail=True, verbose_filehandle=self.log_fh, print_errors=False)
if self.assembled_ok:
os.symlink(spades_contigs, os.path.basename(self.assembly_contigs))
else:
spades_errors_file = os.path.join(self.working_dir, 'spades_errors')
with open(spades_errors_file, 'w') as f:
print(err, file=f)
f.close()

os.chdir(cwd)


def _scaffold_with_sspace(self):
if not os.path.exists(self.assembly_contigs):
raise Error('Cannot scaffold because contigs file not found: ' + self.assembly_contigs)

try:
os.mkdir(self.scaffold_dir)
except:
raise Error('Error mkdir '+ self.scaffold_dir)

cwd = os.getcwd()

if self.extern_progs.exe('sspace') is None:
os.chdir(self.assembly_dir)
os.symlink(os.path.basename(self.assembly_contigs), os.path.basename(self.scaffolder_scaffolds))
os.chdir(cwd)
return

os.chdir(self.scaffold_dir)
lib_file = 'lib'
with open(lib_file, 'w') as f:
print('LIB', self.reads1, self.reads2, int(self.reads_insert), self.sspace_sd, 'FR', file=f)

cmd = ' '.join([
'perl', self.extern_progs.exe('sspace'),
'-k', str(self.sspace_k),
'-l', lib_file,
'-s', self.assembly_contigs
])

sspace_scaffolds = os.path.abspath('standard_output.final.scaffolds.fasta')
common.syscall(cmd, verbose=True, verbose_filehandle=self.log_fh)
os.chdir(self.working_dir)
os.symlink(os.path.relpath(sspace_scaffolds), os.path.basename(self.scaffolder_scaffolds))
os.chdir(cwd)


@staticmethod
def _has_gaps_to_fill(filename):
seq_reader = pyfastaq.sequences.file_reader(filename)
for seq in seq_reader:
if 'n' in seq.seq or 'N' in seq.seq:
return True
return False


@staticmethod
def _rename_scaffolds(infile, outfile, prefix):
freader = pyfastaq.sequences.file_reader(infile)
f_out = pyfastaq.utils.open_file_write(outfile)
i = 1
for scaff in freader:
scaff.id = prefix + '.scaffold.' + str(i)
i += 1
print(scaff, file=f_out)
pyfastaq.utils.close(f_out)


def _gap_fill_with_gapfiller(self):
if not os.path.exists(self.scaffolder_scaffolds):
raise Error('Cannot gap fill because scaffolds file not found: ' + self.scaffolder_scaffolds)

cwd = os.getcwd()

if self.extern_progs.exe('gapfiller') is None or not self._has_gaps_to_fill(self.scaffolder_scaffolds):
self._rename_scaffolds(self.scaffolder_scaffolds, self.gapfilled_scaffolds, self.scaff_name_prefix)
return

try:
os.mkdir(self.gapfill_dir)
except:
raise Error('Error mkdir '+ self.gapfill_dir)

os.chdir(self.gapfill_dir)
lib_file = 'lib'
with open(lib_file, 'w') as f:
print('LIB', 'bwa', self.reads1, self.reads2, self.reads_insert, self.sspace_sd, 'FR', file=f)

cmd = ' '.join([
'perl', self.extern_progs.exe('gapfiller'),
'-l', lib_file,
'-s', self.scaffolder_scaffolds
])

gapfilled_scaffolds = os.path.join(self.gapfill_dir, 'standard_output', 'standard_output.gapfilled.final.fa')
common.syscall(cmd, verbose=True, verbose_filehandle=self.log_fh)
self._rename_scaffolds(gapfilled_scaffolds, self.gapfilled_scaffolds, self.scaff_name_prefix)
os.chdir(cwd)


@staticmethod
def _fix_contig_orientation(contigs_fa, ref_fa, outfile, min_id=90, min_length=50, breaklen=50):
'''Changes orientation of each contig to match the reference, when possible.
Returns a set of names of contigs that had hits in both orientations to the reference'''
if not os.path.exists(contigs_fa):
raise Error('Cannot fix orientation of assembly contigs because file not found: ' + contigs_fa)

tmp_coords = os.path.join(outfile + '.tmp.rename.coords')
pymummer.nucmer.Runner(
ref_fa,
contigs_fa,
tmp_coords,
min_id=min_id,
min_length=min_length,
breaklen=breaklen,
).run()

to_revcomp = set()
not_revcomp = set()
file_reader = pymummer.coords_file.reader(tmp_coords)
for hit in file_reader:
if hit.on_same_strand():
not_revcomp.add(hit.qry_name)
else:
to_revcomp.add(hit.qry_name)

os.unlink(tmp_coords)
in_both = to_revcomp.intersection(not_revcomp)

f = pyfastaq.utils.open_file_write(outfile)
seq_reader = pyfastaq.sequences.file_reader(contigs_fa)
for seq in seq_reader:
if seq.id in to_revcomp and seq.id not in in_both:
seq.revcomp()
print(seq, file=f)
pyfastaq.utils.close(f)

return in_both


@staticmethod
def _parse_bam(sequences, bam, min_scaff_depth, max_insert):
if not os.path.exists(bam):
raise Error('File not found: ' + bam)

bam_parser = bam_parse.Parser(bam, sequences)
bam_parser.parse()
bam_parser.write_files(bam)
return bam_parser.scaff_graph_is_consistent(min_scaff_depth, max_insert)


def run(self):
self._assemble_with_spades()
self.sequences = {}

# double-check we got some contigs
number_of_contigs = pyfastaq.tasks.count_sequences(self.assembly_contigs) if os.path.exists(self.assembly_contigs) else 0
if number_of_contigs == 0:
self.assembled_ok = False
# This is to make this object picklable, to keep multithreading happy
self.log_fh = None
return
else:
self.assembled_ok = True

if self.assembled_ok:
self._scaffold_with_sspace()
self._gap_fill_with_gapfiller()

pyfastaq.tasks.filter(self.gapfilled_scaffolds, self.gapfilled_length_filtered, minlength=self.min_scaff_length)
if pyfastaq.tasks.count_sequences(self.gapfilled_length_filtered) == 0:
self.assembled_ok = False
# This is to make this object picklable, to keep multithreading happy
self.log_fh = None
return

contigs_both_strands = self._fix_contig_orientation(self.gapfilled_length_filtered, self.ref_fasta, self.final_assembly_fa)
self.has_contigs_on_both_strands = len(contigs_both_strands) > 0
pyfastaq.tasks.file_to_dict(self.final_assembly_fa, self.sequences)

mapping.run_bowtie2(
self.reads1,
self.reads2,
self.final_assembly_fa,
self.final_assembly_bam[:-4],
threads=1,
sort=True,
samtools=self.extern_progs.exe('samtools'),
bowtie2=self.extern_progs.exe('bowtie2'),
bowtie2_preset=self.bowtie2_preset,
verbose=True,
verbose_filehandle=self.log_fh
)

self.scaff_graph_ok = self._parse_bam(self.sequences, self.final_assembly_bam, self.min_scaff_depth, self.max_insert)

# This is to make this object picklable, to keep multithreading happy
self.log_fh = None
Loading