-
Notifications
You must be signed in to change notification settings - Fork 22
User defined rules #21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Add CodonComplexity module for sequence complexity checks
Update CodonPrediction with complexity checking functionality and updated docstrings
@dansuissa Can you write and run tests, also add explanation to README |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dansuissa I've found bugs in using your find_repeats
method, and propose an alternative that also greatly improves speed.
1). find_repeats
is not providing a set()
of the repeats (i.e. there are duplicates in the output)
2) Your line of repeat_percentage = sum(len(r[0]) * len(r[1]) for r in repeats) / len(sequence) * 100
would provide percent sequence cover much greater than 100%.
Below is a proposed method to ensure unique repeats and improve speed by first precomputing the neighbor dH into an array and using a fast cumulative‑sum lookup.
import numpy as np
NN_PARAMS = {
'AA': -7.9, 'TT': -7.9,
'AT': -7.2, 'TA': -7.2,
'CA': -8.5, 'TG': -8.5,
'GT': -8.4, 'AC': -8.4,
'CT': -7.8, 'AG': -7.8,
'GA': -8.2, 'TC': -8.2,
'CG': -10.6, 'GC': -9.8,
'GG': -8.0, 'CC': -8.0
}
def find_repeats_v2(
sequence: str,
min_length: int = 20,
min_tm: float = 60.0,
max_length: int = 50
) -> List[Tuple[str, List[int]]]:
seq = sequence.upper()
n = len(seq)
if n < min_length:
return []
# Precompute neighbor ΔH cumsum
dh_arr = np.array([NN_PARAMS.get(seq[i:i+2], 0.0)
for i in range(n-1)], dtype=float)
cs = np.concatenate(([0.0], np.cumsum(dh_arr)))
repeats = {}
for L in range(min_length, min(max_length, n)+1):
denom = L * -10.0 + 108.0
for i in range(n - L + 1):
# sum ΔH + initiation
window_dh = (cs[i+L-1] - cs[i]) + 0.1*L
tm = (window_dh * 1000.0) / denom
if tm < min_tm:
continue
pat = seq[i:i+L]
# find all start positions
pos, start = [], 0
while True:
idx = seq.find(pat, start)
if idx < 0: break
pos.append(idx)
start = idx+1
if len(pos) > 1:
repeats[pat] = pos
return list(repeats.items())
def repeat_coverage_pct_by_set(repeats, seq_len):
"""
repeats : List[Tuple[str, List[int]]]
seq_len : int
returns percent of bases covered by at least one repeat
"""
covered = set()
for pat, poses in repeats:
L = len(pat)
for p in poses:
covered.update(range(p, p + L))
return len(covered) / seq_len * 100.0
For a given sequence of:
SEQ = "CGCGCGCGCGCGCGCGCGCGCGCGCCCCTATGGGCATTGCGCT"
# Current methods
repeats = find_repeats(sequence=SEQ, min_length=20, min_tm=60.0) # Method in PR
repeat_percentage = (sum(len(r[0]) * len(r[1]) for r in repeats) / len(SEQ)) * 100 # line from PR
[find_repeats] 0.011506 s
[('CGCGCGCGCGCGCGCGCGCG', [0, 2, 4]), ('GCGCGCGCGCGCGCGCGCGC', [1, 3, 5]), ('CGCGCGCGCGCGCGCGCGCG', [0, 2, 4]), ('GCGCGCGCGCGCGCGCGCGC', [1, 3, 5]), ('CGCGCGCGCGCGCGCGCGCG', [0, 2, 4]), ('GCGCGCGCGCGCGCGCGCGC', [1, 3, 5]), ('CGCGCGCGCGCGCGCGCGCGC', [0, 2, 4]), ('GCGCGCGCGCGCGCGCGCGCG', [1, 3]), ('CGCGCGCGCGCGCGCGCGCGC', [0, 2, 4]), ('GCGCGCGCGCGCGCGCGCGCG', [1, 3]), ('CGCGCGCGCGCGCGCGCGCGC', [0, 2, 4]), ('CGCGCGCGCGCGCGCGCGCGCG', [0, 2]), ('GCGCGCGCGCGCGCGCGCGCGC', [1, 3]), ('CGCGCGCGCGCGCGCGCGCGCG', [0, 2]), ('GCGCGCGCGCGCGCGCGCGCGC', [1, 3]), ('CGCGCGCGCGCGCGCGCGCGCGC', [0, 2]), ('CGCGCGCGCGCGCGCGCGCGCGC', [0, 2])]
Repeat_percentage = 2095.3488372093025%
# New methods
repeats_v2 = find_repeats_v2(sequence=SEQ, min_length=20, min_tm=60.0)
repeat_percentage = repeat_coverage_pct_by_set(repeats_v2, len(SEQ))
[find_repeats_v2] 0.000414 s
[('CGCGCGCGCGCGCGCGCGCG', [0, 2, 4]), ('GCGCGCGCGCGCGCGCGCGC', [1, 3, 5]), ('CGCGCGCGCGCGCGCGCGCGC', [0, 2, 4]), ('GCGCGCGCGCGCGCGCGCGCG', [1, 3]), ('CGCGCGCGCGCGCGCGCGCGCG', [0, 2]), ('GCGCGCGCGCGCGCGCGCGCGC', [1, 3]), ('CGCGCGCGCGCGCGCGCGCGCGC', [0, 2])]
Repeat_percentage = 58.139534883720934%
# Approximate Tm calculation | ||
return round((dH * 1000) / (len(sequence) * -10 + 108), 1) | ||
|
||
def find_repeats(sequence: str, min_length: int = 20, min_tm: float = 60.0) -> List[Tuple[str, List[int]]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
find_repeats
is not providing a set()
of the repeats (i.e. there are duplicates in the output). This method is also slow due to .get
for each nt pair. It would be faster to precompute the neighbor dH into an array and using a fast cumulative‑sum lookup. The below suggestion is much faster (~20X), provides a unique set of repeats, and also calculates tm within find_repeats
.
NN_PARAMS = {
'AA': -7.9, 'TT': -7.9,
'AT': -7.2, 'TA': -7.2,
'CA': -8.5, 'TG': -8.5,
'GT': -8.4, 'AC': -8.4,
'CT': -7.8, 'AG': -7.8,
'GA': -8.2, 'TC': -8.2,
'CG': -10.6, 'GC': -9.8,
'GG': -8.0, 'CC': -8.0
}
def find_repeats(
sequence: str,
min_length: int = 20,
min_tm: float = 60.0,
max_length: int = 50
) -> List[Tuple[str, List[int]]]:
seq = sequence.upper()
n = len(seq)
if n < min_length:
return []
# Precompute neighbor ΔH cumsum
dh_arr = np.array([NN_PARAMS.get(seq[i:i+2], 0.0)
for i in range(n-1)], dtype=float)
cs = np.concatenate(([0.0], np.cumsum(dh_arr)))
repeats = {}
for L in range(min_length, min(max_length, n)+1):
denom = L * -10.0 + 108.0
for i in range(n - L + 1):
# sum ΔH + initiation
window_dh = (cs[i+L-1] - cs[i]) + 0.1*L
tm = (window_dh * 1000.0) / denom
if tm < min_tm:
continue
pat = seq[i:i+L]
# find all start positions
pos, start = [], 0
while True:
idx = seq.find(pat, start)
if idx < 0: break
pos.append(idx)
start = idx+1
if len(pos) > 1:
repeats[pat] = pos
return list(repeats.items())
if "repeat_sequences" in config.enabled_rules: | ||
repeats = find_repeats(sequence, config.max_repeat_length, config.min_repeat_tm) | ||
if repeats: | ||
repeat_percentage = sum(len(r[0]) * len(r[1]) for r in repeats) / len(sequence) * 100 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method seems to provide % coverage > 100% (particularly since find_repeats
does not return a unique set of repeats.
Suggestion to add function for proper sequence coverage, such as:
def repeat_coverage_pct_by_set(repeats, seq_len):
"""
repeats : List[Tuple[str, List[int]]]
seq_len : int
returns percent of bases covered by at least one repeat
"""
covered = set()
for pat, poses in repeats:
L = len(pat)
for p in poses:
covered.update(range(p, p + L))
return len(covered) / seq_len * 100.0
Fixes #7: Add user-defined rules for DNA sequence complexity
Note:
Would you like me to add any additional test cases or documentation?