Add documentation for substitution models

4ment · 4ment · commit 964374787383 · 2024-11-26T17:01:54.000+11:00
diff --git a/README.md b/README.md
@@ -56,6 +56,24 @@ For detailed information on how to use `torchtree` and its features, please refe
 ## Quick start
 `torchtree` requires a JSON file containing models and algorithms. A configuration file can be generated using `torchtree-cli`, a command line-based tool. This two-step process allows the user to adjust values in the configuration file, such as hyperparameters.
 
+`torchtree-cli` implements several subcommands, each corresponding to a different type of inference algorithm.
+A list of available subcommands can be obtained by running `torchtree-cli --help`.
+
+The following subcommands are available:
+
+* `advi`: Automatic differentiation variational inference
+* `hmc`: Hamiltonian Monte Carlo
+* `map`: Maximum *a posteriori*
+* `mcmc`: Markov chain Monte Carlo
+
+Each subcommand/algorithm requires a different set of arguments which can be obtained by running `torchtree-cli <subcommand> --help`.
+
+`torchtree-cli` requires an alignment file in FASTA format and a tree file in either [Newick](https://en.wikipedia.org/wiki/Newick_format) or [NEXUS](https://en.wikipedia.org/wiki/Nexus_file) format.
+While *torchtree* uses the [DendroPy](https://jeetsukumaran.github.io/DendroPy) library to parse and manipulate phylogenetic trees, it is recommended to use a Newick file due to the numerous variations of the NEXUS format.
+
+Let's explore a few examples of how to use these programs using an influenza A virus dataset containing 69 DNA sequences.
+The alignment and tree files are located in the [data](data) directory.
+
 ### 1 - Generating a configuration file
 Some examples of models using variational inference:
 
diff --git a/docs/getting_started/quick_start.rst b/docs/getting_started/quick_start.rst
@@ -17,6 +17,8 @@ The following subcommands are available:
 
 Each subcommand/algorithm requires a different set of arguments which can be obtained by running ``torchtree-cli <subcommand> --help``.
 
+:command:`torchtree-cli` requires an alignment file in FASTA format and a tree file in either `Newick <https://en.wikipedia.org/wiki/Newick_format>`_ or `NEXUS <https://en.wikipedia.org/wiki/Nexus_file>`_ format.
+While *torchtree* uses the `DendroPy <https://jeetsukumaran.github.io/DendroPy/>`_ library to parse and manipulate phylogenetic trees, it is recommended to use a Newick file due to the numerous variations of the NEXUS format.
 
 Let's explore a few examples of how to use these programs using an influenza A virus dataset containing 69 DNA sequences.
 The alignment and tree files are located in the `data <https://github.com/4ment/torchtree/tree/master/data>`_ directory of the `torchtree <https://github.com/4ment/torchtree>`_ repository.
diff --git a/docs/index.rst b/docs/index.rst
@@ -9,7 +9,7 @@ torchtree is a program designed for developing and inferring phylogenetic models
 It is implemented in Python and uses PyTorch to leverage automatic differentiation.
 Inference algorithms include variational inference, Hamiltonian Monte Carlo, maximum *a posteriori* and Markov chain Monte Carlo.
 
-For a comprehensive assessment of torchtree's performance and use cases, please see our evaluation repository, [torchtree-experiments](https://github.com/4ment/torchtree-experiments), where torchtree was rigorously tested on various datasets and benchmarked for accuracy and speed.
+For a comprehensive assessment of torchtree's performance and use cases, please see our evaluation repository, `torchtree-experiments <https://github.com/4ment/torchtree-experiments>`_, where torchtree was rigorously tested on various datasets and benchmarked for accuracy and speed.
 
 Installation
 ------------
@@ -26,25 +26,24 @@ Installation
       pip install torchtree
 
 
-Plug-ins
-------------------
+.. toctree::
+   :maxdepth: 1
+   :caption: Getting Started
 
-torchtree can be easily extended without modifying the code base thanks its modular implementation. Some examples of external packages:
+   getting_started/install.rst
+   getting_started/quick_start.rst
+   getting_started/features.rst
+   getting_started/json_reference.rst
+   getting_started/plugins.rst
 
-- torchtree-bito_: is a plug-in interfacing the bito_ library for fast gradient calculations with BEAGLE_.
-- torchtree-physher_: is a plug-in interfacing physher_ for fast gradient calculations of tree likelihood and coalescent models.
-- torchtree-scipy_: is a plug-in interfacing the SciPy package.
-- torchtree-tensorflow_: is a plug-in interacting with Tensorflow.
 
-A GitHub `template <https://github.com/4ment/torchtree-plugin-template>`_ is available to assist in the development of a plug-in, and it is highly recommended to use it. This template provides a structured starting point, ensuring consistency and compatibility with `torchtree` while streamlining the development process.
+.. toctree::
+   :maxdepth: 1
+   :caption: Advanced
 
-.. _torchtree-bito: http://github.com/4ment/torchtree-bito
-.. _torchtree-physher: http://github.com/4ment/torchtree-physher
-.. _torchtree-scipy: http://github.com/4ment/torchtree-scipy
-.. _torchtree-tensorflow: http://github.com/4ment/torchtree-tensorflow
-.. _physher: http://github.com/4ment/physher
-.. _BEAGLE: https://github.com/beagle-dev/beagle-lib
-.. _bito: https://github.com/phylovi/bito
+   advanced/concepts.rst
+   advanced/tree_likelihood.rst
+   advanced/benchmark.rst
 
 
 How to cite
@@ -68,24 +67,6 @@ How to cite
             url={https://arxiv.org/abs/2406.18044}
       }
 
-.. toctree::
-   :maxdepth: 1
-   :caption: Getting Started
-
-   getting_started/install.rst
-   getting_started/quick_start.rst
-   getting_started/features.rst
-   getting_started/json_reference.rst
-
-
-.. toctree::
-   :maxdepth: 1
-   :caption: Advanced
-
-   advanced/concepts.rst
-   advanced/tree_likelihood.rst
-   advanced/benchmark.rst
-
 .. toctree::
    :hidden:
    :caption: API
diff --git a/torchtree/cli/evolution.py b/torchtree/cli/evolution.py
@@ -59,13 +59,17 @@
 
 def create_evolution_parser(parser):
     group = parser.add_mutually_exclusive_group()
-    group.add_argument("-i", "--input", required=False, help="""alignment file""")
+    group.add_argument(
+        "-i", "--input", required=False, help="""alignment file in FASTA format"""
+    )
     group.add_argument(
         "--poisson",
         action="store_true",
         help="""use poisson tree likelihood""",
     )
-    parser.add_argument("-t", "--tree", required=True, help="""tree file""")
+    parser.add_argument(
+        "-t", "--tree", required=True, help="""tree file in newick or NEXUS format"""
+    )
     parser.add_argument(
         "-m",
         "--model",
diff --git a/torchtree/evolution/substitution_model/__init__.py b/torchtree/evolution/substitution_model/__init__.py
@@ -1,3 +1,63 @@
+r"""Markov substitution process.
+
+The Markov substitution process is a stochastic model used to describe the evolution of sequences (such as nucleotides in DNA or amino acids in proteins) along the branches of a phylogenetic tree.
+This process is based on the principle of Markov chains, where the state of a system at a given time depends only on its immediate previous state.
+In the context of phylogenetics, the states typically correspond to the nucleotide or character states at each site in the sequence.
+
+In this model, the substitution process is described by a **rate matrix** (also called the **Q matrix**), which specifies the instantaneous rates at which one state can change to another.
+
+1. **Markov Process Basics**
+----------------------------
+
+The substitution process is modeled as a continuous-time Markov process. The probability of observing a sequence of substitutions between nucleotides (or other characters) over time is governed by a set of transition rates, which are typically represented in a rate matrix :math:`Q`.
+
+The rate matrix :math:`Q` is defined as a square matrix where the off-diagonal elements represent the rate of substitution between different states (nucleotides or characters), and the diagonal elements are negative, ensuring the rows sum to zero.
+
+For a simple example of nucleotide substitutions with 4 states \{A,C,G,T\}, the rate matrix :math:`Q` might look like:
+
+.. math::
+
+    Q = \begin{bmatrix}
+    -3\mu & \mu & \mu & \mu \\
+    \mu & -3\mu & \mu & \mu \\
+    \mu & \mu & -3\mu & \mu \\
+    \mu & \mu & \mu & -3\mu
+    \end{bmatrix}
+
+where :math:`\mu` represents the rate of substitution between different nucleotide pairs.
+
+2. **Probability Matrix and Time Evolution**
+--------------------------------------------
+
+The probability of transitioning between states over a given time period :math:`t` is given by the **probability matrix** :math:`P(t)`, which is related to the rate matrix :math:`Q` by the matrix exponential:
+
+.. math::
+
+    P(t) = e^{Qt}
+
+This matrix describes the probability of moving from one nucleotide (or character) to another after time :math:`t`, based on the rates defined in :math:`Q`.
+
+The probability matrix is essential for calculating the likelihood of a given sequence alignment under a particular evolutionary model.
+The transition probabilities can be computed over each branch of a phylogenetic tree, and the tree can be estimated by maximizing the likelihood of the observed data under the chosen model.
+
+3. **Stationary Distribution**
+------------------------------
+
+In some cases, the Markov process reaches a stationary distribution, where the probabilities of being in each state become constant over time.
+For nucleotide substitution models, the stationary distribution often represents the equilibrium base frequencies of the nucleotides.
+
+For example, if the stationary distribution for the nucleotide 'A' is :math:`\pi_A`, then the probability of observing 'A' at equilibrium will be :math:`\pi_A`, and similarly for the other nucleotides.
+
+4. **Markov Substitution in Phylogenetics**
+-------------------------------------------
+
+In phylogenetics, the Markov substitution process is used to model how nucleotide or character states evolve along the branches of a phylogenetic tree.
+The phylogenetic tree represents the relationships between species, and branch lengths represent the amount of evolutionary time or the number of substitutions.
+
+Given a sequence alignment and a model of nucleotide substitution (such as the Jukes-Cantor or GTR models), the goal is to estimate the evolutionary history of the species (the tree) and the rates of substitution along the branches.
+The Markov substitution model allows the computation of the likelihood of observing the given sequences under the model, and the tree that maximizes this likelihood is considered the most probable evolutionary tree.
+
+"""
 from torchtree.evolution.substitution_model.amino_acid import LG, WAG
 from torchtree.evolution.substitution_model.codon import MG94
 from torchtree.evolution.substitution_model.general import (
diff --git a/torchtree/evolution/substitution_model/general.py b/torchtree/evolution/substitution_model/general.py
@@ -86,6 +86,63 @@ def from_json(cls, data, dic):
 
 @register_class
 class GeneralSymmetricSubstitutionModel(SymmetricSubstitutionModel):
+    r"""General symmetric substitution model.
+
+    The state space :math:`\Omega=\{S_0, S_1, \dots, S_{M-1}\}` of this model is defined by the `DataType` object.
+    
+    This model is composed of:
+
+    - :math:`K` substitution rate parameters: :math:`\mathbf{r}=r_0, r_1, \dots, r_{K-1}` where :math:`K \leq (M^2-M)/2`.
+    - :math:`M` equilibrium frequency parameters: :math:`\pi_0, \pi_1, \dots, \pi_{M-1}`.
+    - A mapping function that associates each matrix element :math:`Q_{ij}` to an index in the set of rates :math:`f: \{0, 1, \dots, (M^2-M)/2-1\} \rightarrow \{0,1, \dots, K-1\}`
+    
+    The matrix :math:`Q` is thus defined as:
+
+    .. math::
+
+        Q_{ij} =
+        \begin{cases}
+        r_{f(i \cdot M + j)} \pi_j & \text{if } i \neq j \\
+        -\sum_{k \neq i} Q_{ik} & \text{if } i = j
+        \end{cases}
+    
+    where :math:`i,j \in \{0,1, \dots, M-1\}` are zero-based indices for rows and columns.
+
+    :math:`f` is implemented as a one-dimentional array :math:`\mathbf{g}[x]=f(x)` for :math:`x \in \{0, 1,\dots, (M^2-M)/2-1\}` where each element maps a position in :math:`Q` to an index in the rate array :math:`r`.
+    The mapping is defined such as the position :math:`(i,j)` in :math:`Q` corresponds to :math:`i \cdot M+ j` for :math:`i \neq j`.
+    The indices correspond to first iterating over rows (row 0, then row 1, etc.) and then over columns for each row of the upper off-diagonal elements.
+
+    The HKY substitution model can be defined as a symmetric substitution model with M=4 frequency parameters and rate parameters :math:`\mathbf{r}=r_0, r_1`.
+    The mapping function is therefore:
+    
+    .. math::
+
+        f(k) =
+        \begin{cases}
+        0 & \text{if } k = i \cdot 4 + j \text{ and } i \rightarrow j \text{ is transversion}\\
+        1 & \text{otherwise}
+        \end{cases}
+    
+    As a one-dimentional array, the mapping is defined as :math:`\mathbf{g}=[0,1,0,0,1,0]`.
+    
+    The HKY rate matrix :math:`Q` is given as:
+
+    .. math::
+
+        Q_{HKY} = 
+        \begin{bmatrix}
+        -(r_0 \pi_C + r_1 \pi_G + r_0 \pi_T) & r_0 \pi_C & r_1 \pi_G & r_0 \pi_T \\
+        r_0 \pi_A & -(r_0 \pi_A + r_0 \pi_G + r_0 \pi_T) & r_0 \pi_G & r_1 \pi_T \\
+        r_1 \pi_A & r_0 \pi_C & -(r_1\pi_A + r_0 \pi_C + r_0 \pi_T) & r_0 \pi_T \\
+        r_0 \pi_A & r_1 \pi_C & r_0 \pi_G & -(r_0 \pi_A + r_1 \pi_C + r_0 \pi_G)
+        \end{bmatrix}
+    
+    Similarly the GTR model can be specified with :math:`\mathbf{g}=[0,1,2,3,4,5]` and :math:`\mathbf{r}=r_0, r_1, r_2, r_3, r_4, r_5`.
+    
+    .. note::
+        The order of the equilibrium frequencies in a :class:`~torchtree.Parameter` is expected to be the order of the states defined in the DataType object.
+    """
+
     def __init__(
         self,
         id_: ID,
@@ -144,6 +201,36 @@ def from_json(cls, data, dic):
 
 @register_class
 class GeneralNonSymmetricSubstitutionModel(NonSymmetricSubstitutionModel):
+    r"""General non-symmetric substitution model.
+
+    The state space :math:`\Omega=\{S_0, S_1, \dots, S_{M-1}\}` of this model is defined by the `DataType` object.
+    
+    This model is composed of:
+
+    - :math:`K` substitution rate parameters: :math:`\mathbf{r}=r_0, r_1, \dots, r_{K-1}` where :math:`K \leq (M^2-M)`.
+    - :math:`M` equilibrium frequency parameters: :math:`\pi_0, \pi_1, \dots, \pi_{M-1}`.
+    - A mapping function that associates each matrix element :math:`Q_{ij}` to an index in the set of rates :math:`f: \{0, 1, \dots, (M^2-M)-1\} \rightarrow \{0,1, \dots, K-1\}`
+    
+    The matrix :math:`Q` is thus defined as:
+
+    .. math::
+
+        Q_{ij} =
+        \begin{cases}
+        r_{f(i \cdot M + j)} \pi_j & \text{if } i \neq j \\
+        -\sum_{k \neq i} Q_{ik} & \text{if } i = j
+        \end{cases}
+    
+    where :math:`i,j \in \{0,1, \dots, M-1\}` are zero-based indices for rows and columns.
+
+    :math:`f` is implemented as a one-dimentional array :math:`\mathbf{g}[x]=f(x)` for :math:`x \in \{0, 1,\dots, (M^2-M)-1\}` where each element maps a position in :math:`Q` to an index in the rate array :math:`r`.
+    The mapping is defined such as the position :math:`(i,j)` in :math:`Q` corresponds to :math:`i \cdot M + j` for :math:`i > j` and :math:`j \cdot M + i + (M^2-M)/2` for :math:`i < j`.
+    In other words, the first of :math:`\mathbf{g}` corresponds to the upper off-diagonal elements and the second to the lower off-diagonal elements.
+    
+    .. note::
+        The order of the equilibrium frequencies in a :class:`~torchtree.Parameter` is expected to be the order of the states defined in the DataType object.
+    """
+
     def __init__(
         self,
         id_: ID,
diff --git a/torchtree/evolution/substitution_model/nucleotide.py b/torchtree/evolution/substitution_model/nucleotide.py