Trias: an encoder-decoder model for generating synthetic eukaryotic mRNA sequences

Trias is an encoder-decoder language model trained to reverse-translate protein sequences into codon sequences. It learns codon usage patterns from 10 million mRNA coding sequences across 640 vertebrate species, enabling context-aware sequence generation without requiring handcrafted rules.

Setup and installation

Trias is developed and tested with Python 3.8.8 and uses Weights & Biases for logging training progress.

We recommend using conda

conda create -n trias python=3.8.8
conda activate trias

Install dependencies

git clone https://github.com/lareaulab/Trias.git
cd Trias
pip install -e .

Or use requirements.txt

pip install -r requirements.txt

Reverse Translation

Trias generates optimized codon sequences from protein input using a pretrained model. You can use the checkpoint hosted on Hugging Face (lareaulab/Trias) or a local model directory. It supports execution on both CPU and GPU. And we provide both greedy decoding and beam search for flexible output control.

Greedy decoding selects the most likely token at each step, it's faster and deterministic. Beam search explores multiple candidate paths and is better for longer or complex proteins, but is also slower.

Greedy search

python scripts/reverse_translation.py \
  --model_path lareaulab/Trias \
  --protein_sequence "MTEITAAMVKELRESTGAGMMDCKNALSETQ*" \
  --species "Homo sapiens" \
  --decoding greedy

Beam search

python scripts/reverse_translation.py \
  --model_path lareaulab/Trias \
  --protein_sequence "MTEITAAMVKELRESTGAGMMDCKNALSETQ*" \
  --species "Homo sapiens" \
  --decoding beam \
  --beam_width 5

Dataset format

To train Trias, your dataset must include the following columns:

protein: Amino acid sequence, must end with * (stop codon)
species_name: Label identifying the species (e.g., "Homo sapiens")
mrna: Full mRNA sequence
codon_start: 0-based index of the first nucleotide of the coding region in the mrna
codon_end: 0-based index of the last nucleotide of the stop codon

Supported file formats:

.parquet, .csv, .json

Model training

Use the provided training script to launch a run

bash scripts/train_trias.sh

This launches a full training session using main.py. You can customize:

Model architecture (hidden size, number of layers, attention heads, etc.)
Training parameters (steps, batch size, learning rate, etc.)

Reproducing figures

All figure generation code is available in the notebook:

notebooks/trias_figures.ipynb

To reproduce the figures from the paper, please ensure you download the following datasets and place them in the appropriate directory (see comments in the notebook for expected paths).

1. GTEx expression data

Visit the GTEx Portal and under GTEx Analysis V8, download the file:

GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_median_tpm.gct.gz

2. GFP data from Bicknell et al. (2024)

Visit the Cell Reports article and download Table S3 under Supplemental information.

3. Additional datasets

The full dataset to reproduce the figures (~13MB zipped) is included in this repo as data.zip.

To use it in the notebook unzip the file:

unzip data.zip

This will extract the data/ folder. Don't forget to adjust the file path in the notebook to point to the extracted data/ directory.

Citation

If you use Trias, please cite our work:

@article{faizi2025,
  title={A generative language model decodes contextual constraints on codon choice for mRNA design},
  author={Marjan Faizi and Helen Sakharova and Liana F. Lareau},
  journal={bioRxiv},
  year={2025},
  url={https://doi.org/10.1101/2025.05.13.653614}
}

Name		Name	Last commit message	Last commit date
Latest commit History 98 Commits
examples/dummy_dataset		examples/dummy_dataset
notebooks		notebooks
scripts		scripts
src/trias		src/trias
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
data.zip		data.zip
overview.png		overview.png
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Trias: an encoder-decoder model for generating synthetic eukaryotic mRNA sequences

Setup and installation

Reverse Translation

Dataset format

Model training

Reproducing figures

1. GTEx expression data

2. GFP data from Bicknell et al. (2024)

3. Additional datasets

Citation

About

Uh oh!

Releases

Packages

Languages

License

lareaulab/Trias

Folders and files

Latest commit

History

Repository files navigation

Trias: an encoder-decoder model for generating synthetic eukaryotic mRNA sequences

Setup and installation

Reverse Translation

Dataset format

Model training

Reproducing figures

1. GTEx expression data

2. GFP data from Bicknell et al. (2024)

3. Additional datasets

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages