Skip to content

User defined rules #21

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Conversation

dansuissa
Copy link
Collaborator

Fixes #7: Add user-defined rules for DNA sequence complexity

  • Added new CodonComplexity module with configurable rules
  • Implemented Twist's synthesis requirements:
    • Repeat sequence detection (≥20bp or Tm ≥60°C)
    • GC content checks (25-65% globally, max 52% deviation)
    • Homopolymer identification
    • HIS tag pattern validation
  • Added sequence regeneration capability when complexity checks fail
  • Updated prediction function to support complexity checking

Note:
Would you like me to add any additional test cases or documentation?

Add CodonComplexity module for sequence complexity checks
Update CodonPrediction with complexity checking functionality and updated docstrings
@Adibvafa
Copy link
Owner

Adibvafa commented Jul 1, 2025

@dansuissa Can you write and run tests, also add explanation to README

Copy link

@PlantDr430 PlantDr430 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dansuissa I've found bugs in using your find_repeats method, and propose an alternative that also greatly improves speed.

1). find_repeats is not providing a set() of the repeats (i.e. there are duplicates in the output)
2) Your line of repeat_percentage = sum(len(r[0]) * len(r[1]) for r in repeats) / len(sequence) * 100
would provide percent sequence cover much greater than 100%.

Below is a proposed method to ensure unique repeats and improve speed by first precomputing the neighbor dH into an array and using a fast cumulative‑sum lookup.

import numpy as np

NN_PARAMS = {
    'AA': -7.9, 'TT': -7.9,
    'AT': -7.2, 'TA': -7.2,
    'CA': -8.5, 'TG': -8.5,
    'GT': -8.4, 'AC': -8.4,
    'CT': -7.8, 'AG': -7.8,
    'GA': -8.2, 'TC': -8.2,
    'CG': -10.6, 'GC': -9.8,
    'GG': -8.0, 'CC': -8.0
}

def find_repeats_v2(
    sequence: str,
    min_length: int = 20,
    min_tm: float = 60.0,
    max_length: int = 50
) -> List[Tuple[str, List[int]]]:
    seq = sequence.upper()
    n = len(seq)
    if n < min_length:
        return []

    # Precompute neighbor ΔH cumsum
    dh_arr = np.array([NN_PARAMS.get(seq[i:i+2], 0.0) 
                       for i in range(n-1)], dtype=float)
    cs = np.concatenate(([0.0], np.cumsum(dh_arr)))

    repeats = {}
    for L in range(min_length, min(max_length, n)+1):
        denom = L * -10.0 + 108.0

        for i in range(n - L + 1):
            # sum ΔH + initiation
            window_dh = (cs[i+L-1] - cs[i]) + 0.1*L
            tm = (window_dh * 1000.0) / denom
            if tm < min_tm:
                continue

            pat = seq[i:i+L]
            # find all start positions
            pos, start = [], 0
            while True:
                idx = seq.find(pat, start)
                if idx < 0: break
                pos.append(idx)
                start = idx+1

            if len(pos) > 1:
                repeats[pat] = pos

    return list(repeats.items())

def repeat_coverage_pct_by_set(repeats, seq_len):
    """
    repeats : List[Tuple[str, List[int]]]  
    seq_len : int
    returns percent of bases covered by at least one repeat
    """
    covered = set()
    for pat, poses in repeats:
        L = len(pat)
        for p in poses:
            covered.update(range(p, p + L))
    return len(covered) / seq_len * 100.0

For a given sequence of:

SEQ = "CGCGCGCGCGCGCGCGCGCGCGCGCCCCTATGGGCATTGCGCT"

# Current methods
repeats = find_repeats(sequence=SEQ, min_length=20, min_tm=60.0) # Method in PR
repeat_percentage = (sum(len(r[0]) * len(r[1]) for r in repeats) / len(SEQ)) * 100 # line from PR
[find_repeats] 0.011506 s
[('CGCGCGCGCGCGCGCGCGCG', [0, 2, 4]), ('GCGCGCGCGCGCGCGCGCGC', [1, 3, 5]), ('CGCGCGCGCGCGCGCGCGCG', [0, 2, 4]), ('GCGCGCGCGCGCGCGCGCGC', [1, 3, 5]), ('CGCGCGCGCGCGCGCGCGCG', [0, 2, 4]), ('GCGCGCGCGCGCGCGCGCGC', [1, 3, 5]), ('CGCGCGCGCGCGCGCGCGCGC', [0, 2, 4]), ('GCGCGCGCGCGCGCGCGCGCG', [1, 3]), ('CGCGCGCGCGCGCGCGCGCGC', [0, 2, 4]), ('GCGCGCGCGCGCGCGCGCGCG', [1, 3]), ('CGCGCGCGCGCGCGCGCGCGC', [0, 2, 4]), ('CGCGCGCGCGCGCGCGCGCGCG', [0, 2]), ('GCGCGCGCGCGCGCGCGCGCGC', [1, 3]), ('CGCGCGCGCGCGCGCGCGCGCG', [0, 2]), ('GCGCGCGCGCGCGCGCGCGCGC', [1, 3]), ('CGCGCGCGCGCGCGCGCGCGCGC', [0, 2]), ('CGCGCGCGCGCGCGCGCGCGCGC', [0, 2])]
Repeat_percentage = 2095.3488372093025%

# New methods
repeats_v2 = find_repeats_v2(sequence=SEQ, min_length=20, min_tm=60.0)
repeat_percentage = repeat_coverage_pct_by_set(repeats_v2, len(SEQ))

[find_repeats_v2] 0.000414 s
[('CGCGCGCGCGCGCGCGCGCG', [0, 2, 4]), ('GCGCGCGCGCGCGCGCGCGC', [1, 3, 5]), ('CGCGCGCGCGCGCGCGCGCGC', [0, 2, 4]), ('GCGCGCGCGCGCGCGCGCGCG', [1, 3]), ('CGCGCGCGCGCGCGCGCGCGCG', [0, 2]), ('GCGCGCGCGCGCGCGCGCGCGC', [1, 3]), ('CGCGCGCGCGCGCGCGCGCGCGC', [0, 2])]
Repeat_percentage = 58.139534883720934%

# Approximate Tm calculation
return round((dH * 1000) / (len(sequence) * -10 + 108), 1)

def find_repeats(sequence: str, min_length: int = 20, min_tm: float = 60.0) -> List[Tuple[str, List[int]]]:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

find_repeats is not providing a set() of the repeats (i.e. there are duplicates in the output). This method is also slow due to .get for each nt pair. It would be faster to precompute the neighbor dH into an array and using a fast cumulative‑sum lookup. The below suggestion is much faster (~20X), provides a unique set of repeats, and also calculates tm within find_repeats.

NN_PARAMS = {
    'AA': -7.9, 'TT': -7.9,
    'AT': -7.2, 'TA': -7.2,
    'CA': -8.5, 'TG': -8.5,
    'GT': -8.4, 'AC': -8.4,
    'CT': -7.8, 'AG': -7.8,
    'GA': -8.2, 'TC': -8.2,
    'CG': -10.6, 'GC': -9.8,
    'GG': -8.0, 'CC': -8.0
}

def find_repeats(
    sequence: str,
    min_length: int = 20,
    min_tm: float = 60.0,
    max_length: int = 50
) -> List[Tuple[str, List[int]]]:
    seq = sequence.upper()
    n = len(seq)
    if n < min_length:
        return []

    # Precompute neighbor ΔH cumsum
    dh_arr = np.array([NN_PARAMS.get(seq[i:i+2], 0.0) 
                       for i in range(n-1)], dtype=float)
    cs = np.concatenate(([0.0], np.cumsum(dh_arr)))

    repeats = {}
    for L in range(min_length, min(max_length, n)+1):
        denom = L * -10.0 + 108.0

        for i in range(n - L + 1):
            # sum ΔH + initiation
            window_dh = (cs[i+L-1] - cs[i]) + 0.1*L
            tm = (window_dh * 1000.0) / denom
            if tm < min_tm:
                continue

            pat = seq[i:i+L]
            # find all start positions
            pos, start = [], 0
            while True:
                idx = seq.find(pat, start)
                if idx < 0: break
                pos.append(idx)
                start = idx+1

            if len(pos) > 1:
                repeats[pat] = pos

    return list(repeats.items())

if "repeat_sequences" in config.enabled_rules:
repeats = find_repeats(sequence, config.max_repeat_length, config.min_repeat_tm)
if repeats:
repeat_percentage = sum(len(r[0]) * len(r[1]) for r in repeats) / len(sequence) * 100

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method seems to provide % coverage > 100% (particularly since find_repeats does not return a unique set of repeats.

Suggestion to add function for proper sequence coverage, such as:

def repeat_coverage_pct_by_set(repeats, seq_len):
    """
    repeats : List[Tuple[str, List[int]]]  
    seq_len : int
    returns percent of bases covered by at least one repeat
    """
    covered = set()
    for pat, poses in repeats:
        L = len(pat)
        for p in poses:
            covered.update(range(p, p + L))
    return len(covered) / seq_len * 100.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow user-defined rules
3 participants