Skip to content

Commit 9643747

Browse files
committed
Add documentation for substitution models
1 parent 219cc0f commit 9643747

File tree

7 files changed

+317
-36
lines changed

7 files changed

+317
-36
lines changed

README.md

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,24 @@ For detailed information on how to use `torchtree` and its features, please refe
5656
## Quick start
5757
`torchtree` requires a JSON file containing models and algorithms. A configuration file can be generated using `torchtree-cli`, a command line-based tool. This two-step process allows the user to adjust values in the configuration file, such as hyperparameters.
5858

59+
`torchtree-cli` implements several subcommands, each corresponding to a different type of inference algorithm.
60+
A list of available subcommands can be obtained by running `torchtree-cli --help`.
61+
62+
The following subcommands are available:
63+
64+
* `advi`: Automatic differentiation variational inference
65+
* `hmc`: Hamiltonian Monte Carlo
66+
* `map`: Maximum *a posteriori*
67+
* `mcmc`: Markov chain Monte Carlo
68+
69+
Each subcommand/algorithm requires a different set of arguments which can be obtained by running `torchtree-cli <subcommand> --help`.
70+
71+
`torchtree-cli` requires an alignment file in FASTA format and a tree file in either [Newick](https://en.wikipedia.org/wiki/Newick_format) or [NEXUS](https://en.wikipedia.org/wiki/Nexus_file) format.
72+
While *torchtree* uses the [DendroPy](https://jeetsukumaran.github.io/DendroPy) library to parse and manipulate phylogenetic trees, it is recommended to use a Newick file due to the numerous variations of the NEXUS format.
73+
74+
Let's explore a few examples of how to use these programs using an influenza A virus dataset containing 69 DNA sequences.
75+
The alignment and tree files are located in the [data](data) directory.
76+
5977
### 1 - Generating a configuration file
6078
Some examples of models using variational inference:
6179

docs/getting_started/quick_start.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,8 @@ The following subcommands are available:
1717

1818
Each subcommand/algorithm requires a different set of arguments which can be obtained by running ``torchtree-cli <subcommand> --help``.
1919

20+
:command:`torchtree-cli` requires an alignment file in FASTA format and a tree file in either `Newick <https://en.wikipedia.org/wiki/Newick_format>`_ or `NEXUS <https://en.wikipedia.org/wiki/Nexus_file>`_ format.
21+
While *torchtree* uses the `DendroPy <https://jeetsukumaran.github.io/DendroPy/>`_ library to parse and manipulate phylogenetic trees, it is recommended to use a Newick file due to the numerous variations of the NEXUS format.
2022

2123
Let's explore a few examples of how to use these programs using an influenza A virus dataset containing 69 DNA sequences.
2224
The alignment and tree files are located in the `data <https://github.com/4ment/torchtree/tree/master/data>`_ directory of the `torchtree <https://github.com/4ment/torchtree>`_ repository.

docs/index.rst

Lines changed: 15 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ torchtree is a program designed for developing and inferring phylogenetic models
99
It is implemented in Python and uses PyTorch to leverage automatic differentiation.
1010
Inference algorithms include variational inference, Hamiltonian Monte Carlo, maximum *a posteriori* and Markov chain Monte Carlo.
1111

12-
For a comprehensive assessment of torchtree's performance and use cases, please see our evaluation repository, [torchtree-experiments](https://github.com/4ment/torchtree-experiments), where torchtree was rigorously tested on various datasets and benchmarked for accuracy and speed.
12+
For a comprehensive assessment of torchtree's performance and use cases, please see our evaluation repository, `torchtree-experiments <https://github.com/4ment/torchtree-experiments>`_, where torchtree was rigorously tested on various datasets and benchmarked for accuracy and speed.
1313

1414
Installation
1515
------------
@@ -26,25 +26,24 @@ Installation
2626
pip install torchtree
2727

2828

29-
Plug-ins
30-
------------------
29+
.. toctree::
30+
:maxdepth: 1
31+
:caption: Getting Started
3132

32-
torchtree can be easily extended without modifying the code base thanks its modular implementation. Some examples of external packages:
33+
getting_started/install.rst
34+
getting_started/quick_start.rst
35+
getting_started/features.rst
36+
getting_started/json_reference.rst
37+
getting_started/plugins.rst
3338

34-
- torchtree-bito_: is a plug-in interfacing the bito_ library for fast gradient calculations with BEAGLE_.
35-
- torchtree-physher_: is a plug-in interfacing physher_ for fast gradient calculations of tree likelihood and coalescent models.
36-
- torchtree-scipy_: is a plug-in interfacing the SciPy package.
37-
- torchtree-tensorflow_: is a plug-in interacting with Tensorflow.
3839

39-
A GitHub `template <https://github.com/4ment/torchtree-plugin-template>`_ is available to assist in the development of a plug-in, and it is highly recommended to use it. This template provides a structured starting point, ensuring consistency and compatibility with `torchtree` while streamlining the development process.
40+
.. toctree::
41+
:maxdepth: 1
42+
:caption: Advanced
4043

41-
.. _torchtree-bito: http://github.com/4ment/torchtree-bito
42-
.. _torchtree-physher: http://github.com/4ment/torchtree-physher
43-
.. _torchtree-scipy: http://github.com/4ment/torchtree-scipy
44-
.. _torchtree-tensorflow: http://github.com/4ment/torchtree-tensorflow
45-
.. _physher: http://github.com/4ment/physher
46-
.. _BEAGLE: https://github.com/beagle-dev/beagle-lib
47-
.. _bito: https://github.com/phylovi/bito
44+
advanced/concepts.rst
45+
advanced/tree_likelihood.rst
46+
advanced/benchmark.rst
4847

4948

5049
How to cite
@@ -68,24 +67,6 @@ How to cite
6867
url={https://arxiv.org/abs/2406.18044}
6968
}
7069

71-
.. toctree::
72-
:maxdepth: 1
73-
:caption: Getting Started
74-
75-
getting_started/install.rst
76-
getting_started/quick_start.rst
77-
getting_started/features.rst
78-
getting_started/json_reference.rst
79-
80-
81-
.. toctree::
82-
:maxdepth: 1
83-
:caption: Advanced
84-
85-
advanced/concepts.rst
86-
advanced/tree_likelihood.rst
87-
advanced/benchmark.rst
88-
8970
.. toctree::
9071
:hidden:
9172
:caption: API

torchtree/cli/evolution.py

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -59,13 +59,17 @@
5959

6060
def create_evolution_parser(parser):
6161
group = parser.add_mutually_exclusive_group()
62-
group.add_argument("-i", "--input", required=False, help="""alignment file""")
62+
group.add_argument(
63+
"-i", "--input", required=False, help="""alignment file in FASTA format"""
64+
)
6365
group.add_argument(
6466
"--poisson",
6567
action="store_true",
6668
help="""use poisson tree likelihood""",
6769
)
68-
parser.add_argument("-t", "--tree", required=True, help="""tree file""")
70+
parser.add_argument(
71+
"-t", "--tree", required=True, help="""tree file in newick or NEXUS format"""
72+
)
6973
parser.add_argument(
7074
"-m",
7175
"--model",

torchtree/evolution/substitution_model/__init__.py

Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,63 @@
1+
r"""Markov substitution process.
2+
3+
The Markov substitution process is a stochastic model used to describe the evolution of sequences (such as nucleotides in DNA or amino acids in proteins) along the branches of a phylogenetic tree.
4+
This process is based on the principle of Markov chains, where the state of a system at a given time depends only on its immediate previous state.
5+
In the context of phylogenetics, the states typically correspond to the nucleotide or character states at each site in the sequence.
6+
7+
In this model, the substitution process is described by a **rate matrix** (also called the **Q matrix**), which specifies the instantaneous rates at which one state can change to another.
8+
9+
1. **Markov Process Basics**
10+
----------------------------
11+
12+
The substitution process is modeled as a continuous-time Markov process. The probability of observing a sequence of substitutions between nucleotides (or other characters) over time is governed by a set of transition rates, which are typically represented in a rate matrix :math:`Q`.
13+
14+
The rate matrix :math:`Q` is defined as a square matrix where the off-diagonal elements represent the rate of substitution between different states (nucleotides or characters), and the diagonal elements are negative, ensuring the rows sum to zero.
15+
16+
For a simple example of nucleotide substitutions with 4 states \{A,C,G,T\}, the rate matrix :math:`Q` might look like:
17+
18+
.. math::
19+
20+
Q = \begin{bmatrix}
21+
-3\mu & \mu & \mu & \mu \\
22+
\mu & -3\mu & \mu & \mu \\
23+
\mu & \mu & -3\mu & \mu \\
24+
\mu & \mu & \mu & -3\mu
25+
\end{bmatrix}
26+
27+
where :math:`\mu` represents the rate of substitution between different nucleotide pairs.
28+
29+
2. **Probability Matrix and Time Evolution**
30+
--------------------------------------------
31+
32+
The probability of transitioning between states over a given time period :math:`t` is given by the **probability matrix** :math:`P(t)`, which is related to the rate matrix :math:`Q` by the matrix exponential:
33+
34+
.. math::
35+
36+
P(t) = e^{Qt}
37+
38+
This matrix describes the probability of moving from one nucleotide (or character) to another after time :math:`t`, based on the rates defined in :math:`Q`.
39+
40+
The probability matrix is essential for calculating the likelihood of a given sequence alignment under a particular evolutionary model.
41+
The transition probabilities can be computed over each branch of a phylogenetic tree, and the tree can be estimated by maximizing the likelihood of the observed data under the chosen model.
42+
43+
3. **Stationary Distribution**
44+
------------------------------
45+
46+
In some cases, the Markov process reaches a stationary distribution, where the probabilities of being in each state become constant over time.
47+
For nucleotide substitution models, the stationary distribution often represents the equilibrium base frequencies of the nucleotides.
48+
49+
For example, if the stationary distribution for the nucleotide 'A' is :math:`\pi_A`, then the probability of observing 'A' at equilibrium will be :math:`\pi_A`, and similarly for the other nucleotides.
50+
51+
4. **Markov Substitution in Phylogenetics**
52+
-------------------------------------------
53+
54+
In phylogenetics, the Markov substitution process is used to model how nucleotide or character states evolve along the branches of a phylogenetic tree.
55+
The phylogenetic tree represents the relationships between species, and branch lengths represent the amount of evolutionary time or the number of substitutions.
56+
57+
Given a sequence alignment and a model of nucleotide substitution (such as the Jukes-Cantor or GTR models), the goal is to estimate the evolutionary history of the species (the tree) and the rates of substitution along the branches.
58+
The Markov substitution model allows the computation of the likelihood of observing the given sequences under the model, and the tree that maximizes this likelihood is considered the most probable evolutionary tree.
59+
60+
"""
161
from torchtree.evolution.substitution_model.amino_acid import LG, WAG
262
from torchtree.evolution.substitution_model.codon import MG94
363
from torchtree.evolution.substitution_model.general import (

torchtree/evolution/substitution_model/general.py

Lines changed: 87 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -86,6 +86,63 @@ def from_json(cls, data, dic):
8686

8787
@register_class
8888
class GeneralSymmetricSubstitutionModel(SymmetricSubstitutionModel):
89+
r"""General symmetric substitution model.
90+
91+
The state space :math:`\Omega=\{S_0, S_1, \dots, S_{M-1}\}` of this model is defined by the `DataType` object.
92+
93+
This model is composed of:
94+
95+
- :math:`K` substitution rate parameters: :math:`\mathbf{r}=r_0, r_1, \dots, r_{K-1}` where :math:`K \leq (M^2-M)/2`.
96+
- :math:`M` equilibrium frequency parameters: :math:`\pi_0, \pi_1, \dots, \pi_{M-1}`.
97+
- A mapping function that associates each matrix element :math:`Q_{ij}` to an index in the set of rates :math:`f: \{0, 1, \dots, (M^2-M)/2-1\} \rightarrow \{0,1, \dots, K-1\}`
98+
99+
The matrix :math:`Q` is thus defined as:
100+
101+
.. math::
102+
103+
Q_{ij} =
104+
\begin{cases}
105+
r_{f(i \cdot M + j)} \pi_j & \text{if } i \neq j \\
106+
-\sum_{k \neq i} Q_{ik} & \text{if } i = j
107+
\end{cases}
108+
109+
where :math:`i,j \in \{0,1, \dots, M-1\}` are zero-based indices for rows and columns.
110+
111+
:math:`f` is implemented as a one-dimentional array :math:`\mathbf{g}[x]=f(x)` for :math:`x \in \{0, 1,\dots, (M^2-M)/2-1\}` where each element maps a position in :math:`Q` to an index in the rate array :math:`r`.
112+
The mapping is defined such as the position :math:`(i,j)` in :math:`Q` corresponds to :math:`i \cdot M+ j` for :math:`i \neq j`.
113+
The indices correspond to first iterating over rows (row 0, then row 1, etc.) and then over columns for each row of the upper off-diagonal elements.
114+
115+
The HKY substitution model can be defined as a symmetric substitution model with M=4 frequency parameters and rate parameters :math:`\mathbf{r}=r_0, r_1`.
116+
The mapping function is therefore:
117+
118+
.. math::
119+
120+
f(k) =
121+
\begin{cases}
122+
0 & \text{if } k = i \cdot 4 + j \text{ and } i \rightarrow j \text{ is transversion}\\
123+
1 & \text{otherwise}
124+
\end{cases}
125+
126+
As a one-dimentional array, the mapping is defined as :math:`\mathbf{g}=[0,1,0,0,1,0]`.
127+
128+
The HKY rate matrix :math:`Q` is given as:
129+
130+
.. math::
131+
132+
Q_{HKY} =
133+
\begin{bmatrix}
134+
-(r_0 \pi_C + r_1 \pi_G + r_0 \pi_T) & r_0 \pi_C & r_1 \pi_G & r_0 \pi_T \\
135+
r_0 \pi_A & -(r_0 \pi_A + r_0 \pi_G + r_0 \pi_T) & r_0 \pi_G & r_1 \pi_T \\
136+
r_1 \pi_A & r_0 \pi_C & -(r_1\pi_A + r_0 \pi_C + r_0 \pi_T) & r_0 \pi_T \\
137+
r_0 \pi_A & r_1 \pi_C & r_0 \pi_G & -(r_0 \pi_A + r_1 \pi_C + r_0 \pi_G)
138+
\end{bmatrix}
139+
140+
Similarly the GTR model can be specified with :math:`\mathbf{g}=[0,1,2,3,4,5]` and :math:`\mathbf{r}=r_0, r_1, r_2, r_3, r_4, r_5`.
141+
142+
.. note::
143+
The order of the equilibrium frequencies in a :class:`~torchtree.Parameter` is expected to be the order of the states defined in the DataType object.
144+
"""
145+
89146
def __init__(
90147
self,
91148
id_: ID,
@@ -144,6 +201,36 @@ def from_json(cls, data, dic):
144201

145202
@register_class
146203
class GeneralNonSymmetricSubstitutionModel(NonSymmetricSubstitutionModel):
204+
r"""General non-symmetric substitution model.
205+
206+
The state space :math:`\Omega=\{S_0, S_1, \dots, S_{M-1}\}` of this model is defined by the `DataType` object.
207+
208+
This model is composed of:
209+
210+
- :math:`K` substitution rate parameters: :math:`\mathbf{r}=r_0, r_1, \dots, r_{K-1}` where :math:`K \leq (M^2-M)`.
211+
- :math:`M` equilibrium frequency parameters: :math:`\pi_0, \pi_1, \dots, \pi_{M-1}`.
212+
- A mapping function that associates each matrix element :math:`Q_{ij}` to an index in the set of rates :math:`f: \{0, 1, \dots, (M^2-M)-1\} \rightarrow \{0,1, \dots, K-1\}`
213+
214+
The matrix :math:`Q` is thus defined as:
215+
216+
.. math::
217+
218+
Q_{ij} =
219+
\begin{cases}
220+
r_{f(i \cdot M + j)} \pi_j & \text{if } i \neq j \\
221+
-\sum_{k \neq i} Q_{ik} & \text{if } i = j
222+
\end{cases}
223+
224+
where :math:`i,j \in \{0,1, \dots, M-1\}` are zero-based indices for rows and columns.
225+
226+
:math:`f` is implemented as a one-dimentional array :math:`\mathbf{g}[x]=f(x)` for :math:`x \in \{0, 1,\dots, (M^2-M)-1\}` where each element maps a position in :math:`Q` to an index in the rate array :math:`r`.
227+
The mapping is defined such as the position :math:`(i,j)` in :math:`Q` corresponds to :math:`i \cdot M + j` for :math:`i > j` and :math:`j \cdot M + i + (M^2-M)/2` for :math:`i < j`.
228+
In other words, the first of :math:`\mathbf{g}` corresponds to the upper off-diagonal elements and the second to the lower off-diagonal elements.
229+
230+
.. note::
231+
The order of the equilibrium frequencies in a :class:`~torchtree.Parameter` is expected to be the order of the states defined in the DataType object.
232+
"""
233+
147234
def __init__(
148235
self,
149236
id_: ID,

0 commit comments

Comments
 (0)