User defined rules #21

dansuissa · 2025-02-13T09:40:24Z

Fixes #7: Add user-defined rules for DNA sequence complexity

Added new CodonComplexity module with configurable rules
Implemented Twist's synthesis requirements:
- Repeat sequence detection (≥20bp or Tm ≥60°C)
- GC content checks (25-65% globally, max 52% deviation)
- Homopolymer identification
- HIS tag pattern validation
Added sequence regeneration capability when complexity checks fail
Updated prediction function to support complexity checking

Note:
Would you like me to add any additional test cases or documentation?

Add CodonComplexity module for sequence complexity checks

Update CodonPrediction with complexity checking functionality and updated docstrings

Adibvafa · 2025-07-01T22:09:52Z

@dansuissa Can you write and run tests, also add explanation to README

PlantDr430

@dansuissa I've found bugs in using your find_repeats method, and propose an alternative that also greatly improves speed.

1). find_repeats is not providing a set() of the repeats (i.e. there are duplicates in the output)
2) Your line of repeat_percentage = sum(len(r[0]) * len(r[1]) for r in repeats) / len(sequence) * 100
would provide percent sequence cover much greater than 100%.

Below is a proposed method to ensure unique repeats and improve speed by first precomputing the neighbor dH into an array and using a fast cumulative‑sum lookup.

import numpy as np

NN_PARAMS = {
    'AA': -7.9, 'TT': -7.9,
    'AT': -7.2, 'TA': -7.2,
    'CA': -8.5, 'TG': -8.5,
    'GT': -8.4, 'AC': -8.4,
    'CT': -7.8, 'AG': -7.8,
    'GA': -8.2, 'TC': -8.2,
    'CG': -10.6, 'GC': -9.8,
    'GG': -8.0, 'CC': -8.0
}

def find_repeats_v2(
    sequence: str,
    min_length: int = 20,
    min_tm: float = 60.0,
    max_length: int = 50
) -> List[Tuple[str, List[int]]]:
    seq = sequence.upper()
    n = len(seq)
    if n < min_length:
        return []

    # Precompute neighbor ΔH cumsum
    dh_arr = np.array([NN_PARAMS.get(seq[i:i+2], 0.0) 
                       for i in range(n-1)], dtype=float)
    cs = np.concatenate(([0.0], np.cumsum(dh_arr)))

    repeats = {}
    for L in range(min_length, min(max_length, n)+1):
        denom = L * -10.0 + 108.0

        for i in range(n - L + 1):
            # sum ΔH + initiation
            window_dh = (cs[i+L-1] - cs[i]) + 0.1*L
            tm = (window_dh * 1000.0) / denom
            if tm < min_tm:
                continue

            pat = seq[i:i+L]
            # find all start positions
            pos, start = [], 0
            while True:
                idx = seq.find(pat, start)
                if idx < 0: break
                pos.append(idx)
                start = idx+1

            if len(pos) > 1:
                repeats[pat] = pos

    return list(repeats.items())

def repeat_coverage_pct_by_set(repeats, seq_len):
    """
    repeats : List[Tuple[str, List[int]]]  
    seq_len : int
    returns percent of bases covered by at least one repeat
    """
    covered = set()
    for pat, poses in repeats:
        L = len(pat)
        for p in poses:
            covered.update(range(p, p + L))
    return len(covered) / seq_len * 100.0

For a given sequence of:

SEQ = "CGCGCGCGCGCGCGCGCGCGCGCGCCCCTATGGGCATTGCGCT"

# Current methods
repeats = find_repeats(sequence=SEQ, min_length=20, min_tm=60.0) # Method in PR
repeat_percentage = (sum(len(r[0]) * len(r[1]) for r in repeats) / len(SEQ)) * 100 # line from PR
[find_repeats] 0.011506 s
[('CGCGCGCGCGCGCGCGCGCG', [0, 2, 4]), ('GCGCGCGCGCGCGCGCGCGC', [1, 3, 5]), ('CGCGCGCGCGCGCGCGCGCG', [0, 2, 4]), ('GCGCGCGCGCGCGCGCGCGC', [1, 3, 5]), ('CGCGCGCGCGCGCGCGCGCG', [0, 2, 4]), ('GCGCGCGCGCGCGCGCGCGC', [1, 3, 5]), ('CGCGCGCGCGCGCGCGCGCGC', [0, 2, 4]), ('GCGCGCGCGCGCGCGCGCGCG', [1, 3]), ('CGCGCGCGCGCGCGCGCGCGC', [0, 2, 4]), ('GCGCGCGCGCGCGCGCGCGCG', [1, 3]), ('CGCGCGCGCGCGCGCGCGCGC', [0, 2, 4]), ('CGCGCGCGCGCGCGCGCGCGCG', [0, 2]), ('GCGCGCGCGCGCGCGCGCGCGC', [1, 3]), ('CGCGCGCGCGCGCGCGCGCGCG', [0, 2]), ('GCGCGCGCGCGCGCGCGCGCGC', [1, 3]), ('CGCGCGCGCGCGCGCGCGCGCGC', [0, 2]), ('CGCGCGCGCGCGCGCGCGCGCGC', [0, 2])]
Repeat_percentage = 2095.3488372093025%

# New methods
repeats_v2 = find_repeats_v2(sequence=SEQ, min_length=20, min_tm=60.0)
repeat_percentage = repeat_coverage_pct_by_set(repeats_v2, len(SEQ))

[find_repeats_v2] 0.000414 s
[('CGCGCGCGCGCGCGCGCGCG', [0, 2, 4]), ('GCGCGCGCGCGCGCGCGCGC', [1, 3, 5]), ('CGCGCGCGCGCGCGCGCGCGC', [0, 2, 4]), ('GCGCGCGCGCGCGCGCGCGCG', [1, 3]), ('CGCGCGCGCGCGCGCGCGCGCG', [0, 2]), ('GCGCGCGCGCGCGCGCGCGCGC', [1, 3]), ('CGCGCGCGCGCGCGCGCGCGCGC', [0, 2])]
Repeat_percentage = 58.139534883720934%

PlantDr430 · 2025-07-17T15:43:50Z

CodonTransformer/CodonComplexity.py

+    # Approximate Tm calculation
+    return round((dH * 1000) / (len(sequence) * -10 + 108), 1)
+
+def find_repeats(sequence: str, min_length: int = 20, min_tm: float = 60.0) -> List[Tuple[str, List[int]]]:


find_repeats is not providing a set() of the repeats (i.e. there are duplicates in the output). This method is also slow due to .get for each nt pair. It would be faster to precompute the neighbor dH into an array and using a fast cumulative‑sum lookup. The below suggestion is much faster (~20X), provides a unique set of repeats, and also calculates tm within find_repeats.

NN_PARAMS = { 'AA': -7.9, 'TT': -7.9, 'AT': -7.2, 'TA': -7.2, 'CA': -8.5, 'TG': -8.5, 'GT': -8.4, 'AC': -8.4, 'CT': -7.8, 'AG': -7.8, 'GA': -8.2, 'TC': -8.2, 'CG': -10.6, 'GC': -9.8, 'GG': -8.0, 'CC': -8.0 } def find_repeats( sequence: str, min_length: int = 20, min_tm: float = 60.0, max_length: int = 50 ) -> List[Tuple[str, List[int]]]: seq = sequence.upper() n = len(seq) if n < min_length: return [] # Precompute neighbor ΔH cumsum dh_arr = np.array([NN_PARAMS.get(seq[i:i+2], 0.0) for i in range(n-1)], dtype=float) cs = np.concatenate(([0.0], np.cumsum(dh_arr))) repeats = {} for L in range(min_length, min(max_length, n)+1): denom = L * -10.0 + 108.0 for i in range(n - L + 1): # sum ΔH + initiation window_dh = (cs[i+L-1] - cs[i]) + 0.1*L tm = (window_dh * 1000.0) / denom if tm < min_tm: continue pat = seq[i:i+L] # find all start positions pos, start = [], 0 while True: idx = seq.find(pat, start) if idx < 0: break pos.append(idx) start = idx+1 if len(pos) > 1: repeats[pat] = pos return list(repeats.items())

PlantDr430 · 2025-07-17T15:45:45Z

CodonTransformer/CodonComplexity.py

+    if "repeat_sequences" in config.enabled_rules:
+        repeats = find_repeats(sequence, config.max_repeat_length, config.min_repeat_tm)
+        if repeats:
+            repeat_percentage = sum(len(r[0]) * len(r[1]) for r in repeats) / len(sequence) * 100


This method seems to provide % coverage > 100% (particularly since find_repeats does not return a unique set of repeats.

Suggestion to add function for proper sequence coverage, such as:

def repeat_coverage_pct_by_set(repeats, seq_len): """ repeats : List[Tuple[str, List[int]]] seq_len : int returns percent of bases covered by at least one repeat """ covered = set() for pat, poses in repeats: L = len(pat) for p in poses: covered.update(range(p, p + L)) return len(covered) / seq_len * 100.0

dansuissa added 2 commits February 13, 2025 10:34

Create CodonComplexity.py

763c8d9

Add CodonComplexity module for sequence complexity checks

Update CodonPrediction.py

818a004

Update CodonPrediction with complexity checking functionality and updated docstrings

dansuissa requested review from Adibvafa and gui11aume February 24, 2025 15:00

PlantDr430 reviewed Jul 17, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

User defined rules #21

User defined rules #21

Uh oh!

dansuissa commented Feb 13, 2025

Uh oh!

Adibvafa commented Jul 1, 2025

Uh oh!

PlantDr430 left a comment •

edited

Loading

Uh oh!

PlantDr430 Jul 17, 2025

Uh oh!

PlantDr430 Jul 17, 2025

Uh oh!

Uh oh!

User defined rules #21

Are you sure you want to change the base?

User defined rules #21

Uh oh!

Conversation

dansuissa commented Feb 13, 2025

Uh oh!

Adibvafa commented Jul 1, 2025

Uh oh!

PlantDr430 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PlantDr430 Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

PlantDr430 Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

PlantDr430 left a comment •

edited

Loading