You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+18Lines changed: 18 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -56,6 +56,24 @@ For detailed information on how to use `torchtree` and its features, please refe
56
56
## Quick start
57
57
`torchtree` requires a JSON file containing models and algorithms. A configuration file can be generated using `torchtree-cli`, a command line-based tool. This two-step process allows the user to adjust values in the configuration file, such as hyperparameters.
58
58
59
+
`torchtree-cli` implements several subcommands, each corresponding to a different type of inference algorithm.
60
+
A list of available subcommands can be obtained by running `torchtree-cli --help`.
Each subcommand/algorithm requires a different set of arguments which can be obtained by running `torchtree-cli <subcommand> --help`.
70
+
71
+
`torchtree-cli` requires an alignment file in FASTA format and a tree file in either [Newick](https://en.wikipedia.org/wiki/Newick_format) or [NEXUS](https://en.wikipedia.org/wiki/Nexus_file) format.
72
+
While *torchtree* uses the [DendroPy](https://jeetsukumaran.github.io/DendroPy) library to parse and manipulate phylogenetic trees, it is recommended to use a Newick file due to the numerous variations of the NEXUS format.
73
+
74
+
Let's explore a few examples of how to use these programs using an influenza A virus dataset containing 69 DNA sequences.
75
+
The alignment and tree files are located in the [data](data) directory.
76
+
59
77
### 1 - Generating a configuration file
60
78
Some examples of models using variational inference:
Copy file name to clipboardExpand all lines: docs/getting_started/quick_start.rst
+2Lines changed: 2 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -17,6 +17,8 @@ The following subcommands are available:
17
17
18
18
Each subcommand/algorithm requires a different set of arguments which can be obtained by running ``torchtree-cli <subcommand> --help``.
19
19
20
+
:command:`torchtree-cli` requires an alignment file in FASTA format and a tree file in either `Newick <https://en.wikipedia.org/wiki/Newick_format>`_ or `NEXUS <https://en.wikipedia.org/wiki/Nexus_file>`_ format.
21
+
While *torchtree* uses the `DendroPy <https://jeetsukumaran.github.io/DendroPy/>`_ library to parse and manipulate phylogenetic trees, it is recommended to use a Newick file due to the numerous variations of the NEXUS format.
20
22
21
23
Let's explore a few examples of how to use these programs using an influenza A virus dataset containing 69 DNA sequences.
22
24
The alignment and tree files are located in the `data <https://github.com/4ment/torchtree/tree/master/data>`_ directory of the `torchtree <https://github.com/4ment/torchtree>`_ repository.
Copy file name to clipboardExpand all lines: docs/index.rst
+15-34Lines changed: 15 additions & 34 deletions
Original file line number
Diff line number
Diff line change
@@ -9,7 +9,7 @@ torchtree is a program designed for developing and inferring phylogenetic models
9
9
It is implemented in Python and uses PyTorch to leverage automatic differentiation.
10
10
Inference algorithms include variational inference, Hamiltonian Monte Carlo, maximum *a posteriori* and Markov chain Monte Carlo.
11
11
12
-
For a comprehensive assessment of torchtree's performance and use cases, please see our evaluation repository, [torchtree-experiments](https://github.com/4ment/torchtree-experiments), where torchtree was rigorously tested on various datasets and benchmarked for accuracy and speed.
12
+
For a comprehensive assessment of torchtree's performance and use cases, please see our evaluation repository, `torchtree-experiments<https://github.com/4ment/torchtree-experiments>`_, where torchtree was rigorously tested on various datasets and benchmarked for accuracy and speed.
13
13
14
14
Installation
15
15
------------
@@ -26,25 +26,24 @@ Installation
26
26
pip install torchtree
27
27
28
28
29
-
Plug-ins
30
-
------------------
29
+
.. toctree::
30
+
:maxdepth:1
31
+
:caption:Getting Started
31
32
32
-
torchtree can be easily extended without modifying the code base thanks its modular implementation. Some examples of external packages:
33
+
getting_started/install.rst
34
+
getting_started/quick_start.rst
35
+
getting_started/features.rst
36
+
getting_started/json_reference.rst
37
+
getting_started/plugins.rst
33
38
34
-
- torchtree-bito_: is a plug-in interfacing the bito_ library for fast gradient calculations with BEAGLE_.
35
-
- torchtree-physher_: is a plug-in interfacing physher_ for fast gradient calculations of tree likelihood and coalescent models.
36
-
- torchtree-scipy_: is a plug-in interfacing the SciPy package.
37
-
- torchtree-tensorflow_: is a plug-in interacting with Tensorflow.
38
39
39
-
A GitHub `template <https://github.com/4ment/torchtree-plugin-template>`_ is available to assist in the development of a plug-in, and it is highly recommended to use it. This template provides a structured starting point, ensuring consistency and compatibility with `torchtree` while streamlining the development process.
Copy file name to clipboardExpand all lines: torchtree/evolution/substitution_model/__init__.py
+60Lines changed: 60 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -1,3 +1,63 @@
1
+
r"""Markov substitution process.
2
+
3
+
The Markov substitution process is a stochastic model used to describe the evolution of sequences (such as nucleotides in DNA or amino acids in proteins) along the branches of a phylogenetic tree.
4
+
This process is based on the principle of Markov chains, where the state of a system at a given time depends only on its immediate previous state.
5
+
In the context of phylogenetics, the states typically correspond to the nucleotide or character states at each site in the sequence.
6
+
7
+
In this model, the substitution process is described by a **rate matrix** (also called the **Q matrix**), which specifies the instantaneous rates at which one state can change to another.
8
+
9
+
1. **Markov Process Basics**
10
+
----------------------------
11
+
12
+
The substitution process is modeled as a continuous-time Markov process. The probability of observing a sequence of substitutions between nucleotides (or other characters) over time is governed by a set of transition rates, which are typically represented in a rate matrix :math:`Q`.
13
+
14
+
The rate matrix :math:`Q` is defined as a square matrix where the off-diagonal elements represent the rate of substitution between different states (nucleotides or characters), and the diagonal elements are negative, ensuring the rows sum to zero.
15
+
16
+
For a simple example of nucleotide substitutions with 4 states \{A,C,G,T\}, the rate matrix :math:`Q` might look like:
17
+
18
+
.. math::
19
+
20
+
Q = \begin{bmatrix}
21
+
-3\mu & \mu & \mu & \mu \\
22
+
\mu & -3\mu & \mu & \mu \\
23
+
\mu & \mu & -3\mu & \mu \\
24
+
\mu & \mu & \mu & -3\mu
25
+
\end{bmatrix}
26
+
27
+
where :math:`\mu` represents the rate of substitution between different nucleotide pairs.
28
+
29
+
2. **Probability Matrix and Time Evolution**
30
+
--------------------------------------------
31
+
32
+
The probability of transitioning between states over a given time period :math:`t` is given by the **probability matrix** :math:`P(t)`, which is related to the rate matrix :math:`Q` by the matrix exponential:
33
+
34
+
.. math::
35
+
36
+
P(t) = e^{Qt}
37
+
38
+
This matrix describes the probability of moving from one nucleotide (or character) to another after time :math:`t`, based on the rates defined in :math:`Q`.
39
+
40
+
The probability matrix is essential for calculating the likelihood of a given sequence alignment under a particular evolutionary model.
41
+
The transition probabilities can be computed over each branch of a phylogenetic tree, and the tree can be estimated by maximizing the likelihood of the observed data under the chosen model.
42
+
43
+
3. **Stationary Distribution**
44
+
------------------------------
45
+
46
+
In some cases, the Markov process reaches a stationary distribution, where the probabilities of being in each state become constant over time.
47
+
For nucleotide substitution models, the stationary distribution often represents the equilibrium base frequencies of the nucleotides.
48
+
49
+
For example, if the stationary distribution for the nucleotide 'A' is :math:`\pi_A`, then the probability of observing 'A' at equilibrium will be :math:`\pi_A`, and similarly for the other nucleotides.
50
+
51
+
4. **Markov Substitution in Phylogenetics**
52
+
-------------------------------------------
53
+
54
+
In phylogenetics, the Markov substitution process is used to model how nucleotide or character states evolve along the branches of a phylogenetic tree.
55
+
The phylogenetic tree represents the relationships between species, and branch lengths represent the amount of evolutionary time or the number of substitutions.
56
+
57
+
Given a sequence alignment and a model of nucleotide substitution (such as the Jukes-Cantor or GTR models), the goal is to estimate the evolutionary history of the species (the tree) and the rates of substitution along the branches.
58
+
The Markov substitution model allows the computation of the likelihood of observing the given sequences under the model, and the tree that maximizes this likelihood is considered the most probable evolutionary tree.
- :math:`M` equilibrium frequency parameters: :math:`\pi_0, \pi_1, \dots, \pi_{M-1}`.
97
+
- A mapping function that associates each matrix element :math:`Q_{ij}` to an index in the set of rates :math:`f: \{0, 1, \dots, (M^2-M)/2-1\} \rightarrow \{0,1, \dots, K-1\}`
98
+
99
+
The matrix :math:`Q` is thus defined as:
100
+
101
+
.. math::
102
+
103
+
Q_{ij} =
104
+
\begin{cases}
105
+
r_{f(i \cdot M + j)} \pi_j & \text{if } i \neq j \\
106
+
-\sum_{k \neq i} Q_{ik} & \text{if } i = j
107
+
\end{cases}
108
+
109
+
where :math:`i,j \in \{0,1, \dots, M-1\}` are zero-based indices for rows and columns.
110
+
111
+
:math:`f` is implemented as a one-dimentional array :math:`\mathbf{g}[x]=f(x)` for :math:`x \in \{0, 1,\dots, (M^2-M)/2-1\}` where each element maps a position in :math:`Q` to an index in the rate array :math:`r`.
112
+
The mapping is defined such as the position :math:`(i,j)` in :math:`Q` corresponds to :math:`i \cdot M+ j` for :math:`i \neq j`.
113
+
The indices correspond to first iterating over rows (row 0, then row 1, etc.) and then over columns for each row of the upper off-diagonal elements.
114
+
115
+
The HKY substitution model can be defined as a symmetric substitution model with M=4 frequency parameters and rate parameters :math:`\mathbf{r}=r_0, r_1`.
116
+
The mapping function is therefore:
117
+
118
+
.. math::
119
+
120
+
f(k) =
121
+
\begin{cases}
122
+
0 & \text{if } k = i \cdot 4 + j \text{ and } i \rightarrow j \text{ is transversion}\\
123
+
1 & \text{otherwise}
124
+
\end{cases}
125
+
126
+
As a one-dimentional array, the mapping is defined as :math:`\mathbf{g}=[0,1,0,0,1,0]`.
- :math:`M` equilibrium frequency parameters: :math:`\pi_0, \pi_1, \dots, \pi_{M-1}`.
212
+
- A mapping function that associates each matrix element :math:`Q_{ij}` to an index in the set of rates :math:`f: \{0, 1, \dots, (M^2-M)-1\} \rightarrow \{0,1, \dots, K-1\}`
213
+
214
+
The matrix :math:`Q` is thus defined as:
215
+
216
+
.. math::
217
+
218
+
Q_{ij} =
219
+
\begin{cases}
220
+
r_{f(i \cdot M + j)} \pi_j & \text{if } i \neq j \\
221
+
-\sum_{k \neq i} Q_{ik} & \text{if } i = j
222
+
\end{cases}
223
+
224
+
where :math:`i,j \in \{0,1, \dots, M-1\}` are zero-based indices for rows and columns.
225
+
226
+
:math:`f` is implemented as a one-dimentional array :math:`\mathbf{g}[x]=f(x)` for :math:`x \in \{0, 1,\dots, (M^2-M)-1\}` where each element maps a position in :math:`Q` to an index in the rate array :math:`r`.
227
+
The mapping is defined such as the position :math:`(i,j)` in :math:`Q` corresponds to :math:`i \cdot M + j` for :math:`i > j` and :math:`j \cdot M + i + (M^2-M)/2` for :math:`i < j`.
228
+
In other words, the first of :math:`\mathbf{g}` corresponds to the upper off-diagonal elements and the second to the lower off-diagonal elements.
229
+
230
+
.. note::
231
+
The order of the equilibrium frequencies in a :class:`~torchtree.Parameter` is expected to be the order of the states defined in the DataType object.
0 commit comments