From 739fc3c5a5f48e49abfd1d687a7dbee9bf2ed61e Mon Sep 17 00:00:00 2001 From: Sebastian Weber Date: Tue, 2 Feb 2021 21:59:42 +0100 Subject: [PATCH 01/14] started to work on design doc --- designs/0027-adjoint-ode.md | 156 ++++++++++++++++++++++++++++++++++++ 1 file changed, 156 insertions(+) create mode 100644 designs/0027-adjoint-ode.md diff --git a/designs/0027-adjoint-ode.md b/designs/0027-adjoint-ode.md new file mode 100644 index 0000000..4cc79f2 --- /dev/null +++ b/designs/0027-adjoint-ode.md @@ -0,0 +1,156 @@ +- Feature Name: ode-adjoint +- Start Date: 2021-02-02 +- RFC PR: (leave this empty) +- Stan Issue: (leave this empty) + +# Summary +[summary]: #summary + +Ordinary differential equations (ODEs) are currently severly costly to +fit in Stan for larger systems. This is due to the multiplicative +scaling of the required computing ressources with the ODE system size +N and the number of parameters M defining the ODE. Currently we +implement in Stan the so-called forward-mode ODE method in order to +obtain in addition to the solution of the ODE the sensitivities of the +ODE. The cost of this method scales as + +$$ N * M . $$ + +The computational cost of the adjoint method is much more favorable +in comparison which is + +$$ 2 * N + M .$$ + +The advantage of the adjoint methods shows in particular whenever the +number of parameters are relatively large in comparison to the number +of states. Most importantly, the computational cost grows only +linearly in the number of states and parameters (while forward grows +exponentially). Thus, this method can scale to much larger problems +without exponentially increasing computational cost. + +# Motivation +[motivation]: #motivation + +Stan is currently practically limited to solve problems with ODEs +which are small in the number of states and parameters. If either of +them gets large, the computational cost explodes quickly. With the +adjoint ODE solving method Stan will be able to scale to much larger +problems involving more states and more parameters. Examples are large +pharmacokinetic / pharmacodynamic systems or physiological based +pharmacokinetic models or bigger susceptible infectious & removed +models. As the adjoint ODE method will grow the computational cost +only linearly with states and parameters, the modelers can more freely +increase the complexity of the ODE model without having to worry too +much about infeasible inference times. + +# Guide-level explanation +[guide-level-explanation]: #guide-level-explanation + +The new solver `ode_adjoint` will follow in it's design the variadic +ODE functions. As the new solver is numerically more involved (see +reference level) there are merely more tuning parameters needed for +the solver. The actual impementation will be provided by the CVODES +package which we already include in Stan. + +From a more internal stan math perspective, the key difference to the +existing forward mode solvers will be that the gradients are not any +more precomputed. Instead, during the forward sweep of reverse mode +autodiff, the ODE is solved first for the N states (without any +sensitivities). When reverse mode AD then performs the backward sweep, +the adjoints of the input operands are directly computed given the +input adjoints. This involves a backward solve in time with N states +and solving M quadrature problems in addition. + +The numerical complexity is higher for the adjoint method in +comparison to the forward method. While most of the complexity is +handled by CVODES, the numerical procedure seems to require more +knowledge about the tuning parameters. At least at the moment it +appears not possible to make an easy to use interface of this +functionality available without most tuning parameters. We need to +first collect some experience before a simpler version can be made +available (if that is feasible at all). The tuning parameters exposed +as proposed are: + +- absolute & relative tolerance forward solve; absolute tolerance + should be a vector of length N, the number of states +- absolute & relative tolerance backward solve +- absolute & relative tolerance quadrature problem +- forward solver: bdf-dense / bdf-iterated / adams - 1 / 2 / 3 +- backward solver: bdf-dense / bdf-iterated / adams - 1 / 2 / 3 +- checkpointing: number of checkpoints every X steps +- checkpointing: hermite or polynomial approximation - 1 / 2 + +During an experimental phase of this feature we can hopefully learn +which of these are relevant and which can be dropped for a final +release of the `ode_adjoint` function. + +**Leaving the template text here for now** + +Explain the proposal as if it was already included in the project and you were teaching it to another Stan programmer in the manual. That generally means: + +- Introducing new named concepts. +- Explaining the feature largely in terms of examples. +- Explaining how Stan programmers should *think* about the feature, and how it should impact the way they use the relevant package. It should explain the impact as concretely as possible. +- If applicable, provide sample error messages, deprecation warnings, or migration guidance. +- If applicable, describe the differences between teaching this to existing Stan programmers and new Stan programmers. + +For implementation-oriented RFCs (e.g. for compiler internals), this section should focus on how compiler contributors should think about the change, and give examples of its concrete impact. For policy RFCs, this section should provide an example-driven introduction to the policy, and explain its impact in concrete terms. + +# Reference-level explanation +[reference-level-explanation]: #reference-level-explanation + +Here I would like to document exactly the math of the approach using +the notation of the CVODES manual. + +TODO !!! + +This is the technical portion of the RFC. Explain the design in sufficient detail that: + +- Its interaction with other features is clear. +- It is reasonably clear how the feature would be implemented. +- Corner cases are dissected by example. + +The section should return to the examples given in the previous section, and explain more fully how the detailed proposal makes those examples work. + +# Drawbacks +[drawbacks]: #drawbacks + +It's some work to be done. Other than that there are no alternatives +to my knowledge to get large ODE systems working in Stan. What we are +missing out for now is to exploit the sparsity structure of the +ODE. This would allow for more efficient solvers and even larger +systems, but this is not possible at the moment to figure out +structurally the inter-dependencies of inputs and outputs. + +# TODO: Update the rest + +# Rationale and alternatives +[rationale-and-alternatives]: #rationale-and-alternatives + +- Why is this design the best in the space of possible designs? +- What other designs have been considered and what is the rationale for not choosing them? +- What is the impact of not doing this? + +# Prior art +[prior-art]: #prior-art + +Discuss prior art, both the good and the bad, in relation to this proposal. +A few examples of what this can include are: + +- For language, library, tools, and compiler proposals: Does this feature exist in other programming languages and what experience have their community had? +- For community proposals: Is this done by some other community and what were their experiences with it? +- For other teams: What lessons can we learn from what other communities have done here? +- Papers: Are there any published papers or great posts that discuss this? If you have some relevant papers to refer to, this can serve as a more detailed theoretical background. + +This section is intended to encourage you as an author to think about the lessons from other languages, provide readers of your RFC with a fuller picture. +If there is no prior art, that is fine - your ideas are interesting to us whether they are brand new or if it is an adaptation from other languages. + +Note that while precedent set by other languages is some motivation, it does not on its own motivate an RFC. +Please also take into consideration that rust sometimes intentionally diverges from common language features. + +# Unresolved questions +[unresolved-questions]: #unresolved-questions + +- What parts of the design do you expect to resolve through the RFC process before this gets merged? +- What parts of the design do you expect to resolve through the implementation of this feature before stabilization? +- What related issues do you consider out of scope for this RFC that could be addressed in the future independently of the solution that comes out of this RFC? From 1855347c9a50ca89b858366fe8892a39d1b1855c Mon Sep 17 00:00:00 2001 From: Sebastian Weber Date: Tue, 16 Feb 2021 22:27:34 +0100 Subject: [PATCH 02/14] finish analytical stuff --- designs/0027-adjoint-ode.md | 183 ++++++++++++++++++++++++++++-------- 1 file changed, 142 insertions(+), 41 deletions(-) diff --git a/designs/0027-adjoint-ode.md b/designs/0027-adjoint-ode.md index 4cc79f2..9f1b7bd 100644 --- a/designs/0027-adjoint-ode.md +++ b/designs/0027-adjoint-ode.md @@ -14,12 +14,12 @@ implement in Stan the so-called forward-mode ODE method in order to obtain in addition to the solution of the ODE the sensitivities of the ODE. The cost of this method scales as -$$ N * M . $$ +$$ N \cdot M . $$ The computational cost of the adjoint method is much more favorable in comparison which is -$$ 2 * N + M .$$ +$$ 2 \cdot N + M .$$ The advantage of the adjoint methods shows in particular whenever the number of parameters are relatively large in comparison to the number @@ -38,10 +38,10 @@ adjoint ODE solving method Stan will be able to scale to much larger problems involving more states and more parameters. Examples are large pharmacokinetic / pharmacodynamic systems or physiological based pharmacokinetic models or bigger susceptible infectious & removed -models. As the adjoint ODE method will grow the computational cost -only linearly with states and parameters, the modelers can more freely -increase the complexity of the ODE model without having to worry too -much about infeasible inference times. +models. The adjoint ODE method will grow the computational cost only +linearly with states and parameters and therefore the modelers can +more freely increase the complexity of the ODE model without having to +worry too much about infeasible inference times. # Guide-level explanation [guide-level-explanation]: #guide-level-explanation @@ -54,12 +54,13 @@ package which we already include in Stan. From a more internal stan math perspective, the key difference to the existing forward mode solvers will be that the gradients are not any -more precomputed. Instead, during the forward sweep of reverse mode -autodiff, the ODE is solved first for the N states (without any -sensitivities). When reverse mode AD then performs the backward sweep, -the adjoints of the input operands are directly computed given the -input adjoints. This involves a backward solve in time with N states -and solving M quadrature problems in addition. +more precomputed during the forward sweep of reverse mode autodiff +(AD). Instead, during the forward sweep of reverse mode autodiff, the +ODE is solved only for the N states (without any sensitivities). When +reverse mode AD then performs the backward sweep, the adjoints of the +input operands are directly computed given the input adjoints. This +involves a backward solve in time with N states and solving $M$ +quadrature problems in addition. The numerical complexity is higher for the adjoint method in comparison to the forward method. While most of the complexity is @@ -84,33 +85,124 @@ During an experimental phase of this feature we can hopefully learn which of these are relevant and which can be dropped for a final release of the `ode_adjoint` function. -**Leaving the template text here for now** - -Explain the proposal as if it was already included in the project and you were teaching it to another Stan programmer in the manual. That generally means: - -- Introducing new named concepts. -- Explaining the feature largely in terms of examples. -- Explaining how Stan programmers should *think* about the feature, and how it should impact the way they use the relevant package. It should explain the impact as concretely as possible. -- If applicable, provide sample error messages, deprecation warnings, or migration guidance. -- If applicable, describe the differences between teaching this to existing Stan programmers and new Stan programmers. - -For implementation-oriented RFCs (e.g. for compiler internals), this section should focus on how compiler contributors should think about the change, and give examples of its concrete impact. For policy RFCs, this section should provide an example-driven introduction to the policy, and explain its impact in concrete terms. - # Reference-level explanation [reference-level-explanation]: #reference-level-explanation -Here I would like to document exactly the math of the approach using -the notation of the CVODES manual. - -TODO !!! - -This is the technical portion of the RFC. Explain the design in sufficient detail that: - -- Its interaction with other features is clear. -- It is reasonably clear how the feature would be implemented. -- Corner cases are dissected by example. - -The section should return to the examples given in the previous section, and explain more fully how the detailed proposal makes those examples work. +The adjoint method for ODEs is relatively involved mathematically. The +main challenge comes from a mix of different notation and conventions +used in different fields. Here we relate commonly used notation within +Stan math to the [user guide of +CVODES](https://computing.llnl.gov/sites/default/files/public/cvs_guide-5.6.1.pdf), +the underlying library we employ to handle the numerical +solution. The goal of reverse mode autodiff is to calculate the +derivative of some function $l(\theta)$ wrt. to it's parameters +$\theta$ at some particular realization of $\theta$. The function $l$ +now depends on the solution of an ODE given as initial value problem + +\begin{align} +\dot{y} &= f(y,t,p) (\#eq:odeDef) \\ + y(t_0) &= y^{(0)}. (\#eq:odeInitial) +\end{align} + +The ODE has $N$ states, $y=(y_1, \ldots, y_N)$, and is +parametrized in terms of parameters $p$ which are a subset of +$\theta$. Let's now further assume for simplicity that the function +$l$ only depends on the solution of the ODE at the time-point +$T$. During the reverse sweep of reverse mode autodiff we are then +given the adjoints of $a_{l,y_n}$ wrt to the output state $y(T,p)$ which +are the partials of $l$ wrt to each state + +$$ a_{l,y_n} = \left.\frac{\partial l}{\partial y_n}\right|_{t=T} $$ + +and we must calculate the contribution to the adjoint of the parameters + +$$ a_{l,p_m} = \sum_{n=1}^{N} \left. a_{l,y_n} \, \frac{\partial y_n}{\partial p_m}\right|_{t=T}, $$ + +which involves for each parameter $p_m$ the partial of every state +$y_n$ wrt to the parameter $p_m$. Note that the computational benefit of the +adjoint method stems from calculating the above sums *directly* in +contrast to the forward method which calculates every partial +$\left. \frac{\partial y_n}{\partial p_m}\right|_{t=T}$ separatley. + +In the notation of the CVODES user manual, the function $g(t,y,p)$ +**is equal to** $l(\theta)$ essentially. Through the use of Lagrange +multipliers, the adjoint problem is transferred to a backward problem +in equation 2.20 of the CVODES manual ([see here for a step-by-step +derivation](https://arxiv.org/abs/2002.00326)), + +\begin{align} +\dot{\lambda} &= - \left(\frac{\partial f}{\partial y}\right)^* +\lambda - \left(\frac{\partial g}{\partial y}\right)^* (\#eq:lambdaOde) \\ + \lambda(T) &= 0. (\#eq:lambdaInitial) +\end{align} + +This ODE is referred to as the *backward* problem, since we fix the +solution $\lambda$ at the final end-point $T$ instead of the initial +condition at $t_0$ (it's a terminal value problem rather than an +initial value problem). The CVODES manual then proceeds with deriving +that + +\begin{equation} +\left.\frac{d g}{d p}\right|_{t=T} = \mu^*(t_0) \, s(t_0) + +\left.\frac{\partial g}{\partial p}\right|_{t=T} + \int_{t_0}^{T} +\mu^* \, \frac{\partial f}{\partial p} \, dt. +(\#eq:derivg) +\end{equation} + +Here $s(t_0)$ is the state sensitivity wrt to the parameters at the +initial time-point. The term $\left.\frac{\partial g}{\partial +p}\right|_{t=T}$ is the partial derivative wrt to the parameters of +$g(t,y,p)$. This term is determined by the terms in $g(t,y,p)$ which directly +depend on the parameters and not implicitly due to the parameters +affecting the ODE state. Thus, this term is being computed by the +autodiff system of stan-math itself as part of the adjoints of the +parameters, $a_{l,p_m}$. Finally, $\mu = \frac{d \lambda}{d T}$ such +that $\mu$ is obtained by the equivalent backward problem (taking the +total derivative of $\dot{\lambda}$ in \@ref(eq:lambdaOde) wrt to $T$) + +\begin{align} +\dot{\mu} &= - \left(\frac{\partial f}{\partial y}\right)^* \, \mu +(\#eq:muODE) \\ +\mu(T) &= \left(\frac{\partial g}{\partial y}\right)^*_{t=T}. (\#eq:muInitial) +\end{align} + +If we now recall that $g(t,y,p)$ is equal to the function $l(\theta)$ +being subject to autodiff we see that + +$$\left(\frac{\partial g}{\partial y_n}\right)^*_{t=T} = a_{l,y_n}^*, $$ + +which is the input we get during reverse mode autodiff (the conjugate +transpose does not apply for Stan math using reals only). + +It is important to note that the backward ODE problem requires as +input the adjoint $a_{l,y_n}^*$ such that the backward problem must be +solved during the reverse pass of reverse mode autodiff. Thus, CVODES +is used during the forward pass to *solve (forward)* the ODE in +\@ref(eq:odeDef). Once the backward pass is initiated, the adjoints +$a_{l,y_n}^*$ are used as the terminal value in order to solve the +*backward problem* of \@ref(eq:muODE). Whenever any parameter +sensitivities are requested, then we must solve along the backward +problem an additional quadrature problem, which is the integrand in +equation \@ref(eq:derivg). + +TODO: outline generalization to multiple time-points. + +In total we need to solve 3 integration problems which consist of a +forward ODE problem, a backward ODE problem and a backward quadrature +problem. The forward ODE and the backward ODE problem can be solved +with either a non-stiff Adams or a stiff BDF solver of CVODES (the +choice is independent for each problem). For the Newton steps of the +BDF routine a linear solver routine is required. The routine can +either be a dense solver or an approximate iterative solver. As we +target large ODE systems it can be useful to allow users to choose the +solver being used. For example, whenever the number of ODE states is +very large, then an iterative solver may be preferable. The reason is +that the dense solver will require the calculation of the full +Jacobian $\frac{df}{dy}$ of the ODE RHS in \@ref(eq:odeDef) wrt to the +states $y$. In contrast, the iterative solver only needs to evaluate +Jacobian-vector products $\frac{df}{dy}\,v$, which we can compute +directly using *forward mode*. In addition all 3 integration problems +have their own relative and absolute tolerance targets. # Drawbacks [drawbacks]: #drawbacks @@ -122,14 +214,23 @@ ODE. This would allow for more efficient solvers and even larger systems, but this is not possible at the moment to figure out structurally the inter-dependencies of inputs and outputs. -# TODO: Update the rest - # Rationale and alternatives [rationale-and-alternatives]: #rationale-and-alternatives -- Why is this design the best in the space of possible designs? -- What other designs have been considered and what is the rationale for not choosing them? -- What is the impact of not doing this? +There is no other analytical technique available which scales in a +comparable way. Hence, larger ODE problems ( many parameters and/or +states) are currently out of scope for Stan and the adjoint technique +does enable Stan to cope with these larger systems. + +The numerical complexities are rather involved as 3 nested +integrations are performed. This makes things somewhat fragile and +less robust. What makes the backward integration in +particular involved is that the solution of the forward problem must +be stored as a continuous function in memory and hence an +interpolation of the forward solve is required. This is provided by +CVODES via a checkpointing procedure. In summary, we do heavily rely +on CVODES infrastructure as these building blocks are rather complex +and heavy to craft on our own. # Prior art [prior-art]: #prior-art From 4b11957fa9b059b316b6e80b65d64c9ed4e82648 Mon Sep 17 00:00:00 2001 From: Sebastian Weber Date: Wed, 17 Feb 2021 21:30:39 +0100 Subject: [PATCH 03/14] add missing bits --- designs/0027-adjoint-ode.md | 92 +++++++++++++++++++++++++++++++------ 1 file changed, 77 insertions(+), 15 deletions(-) diff --git a/designs/0027-adjoint-ode.md b/designs/0027-adjoint-ode.md index 9f1b7bd..0a38260 100644 --- a/designs/0027-adjoint-ode.md +++ b/designs/0027-adjoint-ode.md @@ -185,7 +185,11 @@ sensitivities are requested, then we must solve along the backward problem an additional quadrature problem, which is the integrand in equation \@ref(eq:derivg). -TODO: outline generalization to multiple time-points. +So far we have only considered the case of a single time-point +$T$. For multiple time-points one merely starts the backward +integration from the last time-point and keeps accumulating the +respective terms which result from integrating backwards in steps +until $t_0$ is reached. In total we need to solve 3 integration problems which consist of a forward ODE problem, a backward ODE problem and a backward quadrature @@ -204,6 +208,51 @@ Jacobian-vector products $\frac{df}{dy}\,v$, which we can compute directly using *forward mode*. In addition all 3 integration problems have their own relative and absolute tolerance targets. +The propsed function should has the following signature: + +```stan +vector[] ode_adjoint(F f, + vector y0, + real t0, real[] times, + real rel_tol_f, vector abs_tol_f, + real rel_tol_b, real abs_tol_b, + real rel_tol_q, real abs_tol_q, + int max_num_steps, + int num_checkpoints, + int interpolation_polynomial, + int solver_f, int solver_b + T1 arg1, T2 arg2, ...) +``` + +The arguments are: +1. ```f``` - User-defined right hand side of the ODE (`dy/dt = f`) +2. ```y0``` - Initial state of the ode solve (`y0 = y(t0)`) +3. ```t0``` - Initial time of the ode solve +4. ```times``` - Sorted arary of times to which the ode will be solved (each + element must be greater than t0) +5. ```rel_tol_f``` - Relative tolerance for forward solve (data) +6. ```abs_tol_f``` - Absolute tolerance vector for each state for + forward solve (data) +7. ```rel_tol_b``` - Relative tolerance for backward solve (data) +8. ```abs_tol_b``` - Absolute tolerance for backward solve (data) +9. ```rel_tol_q``` - Relative tolerance for backward quadrature (data) +10. ```abs_tol_q``` - Absolute tolerance for backward quadrature (data) +11. ```max_num_steps``` - Maximum number of timesteps to take in integrating + the ODE solution between output time points for forward and backward + solve (data) +12. ```num_checkpoints``` number of steps between checkpointing forward + solution (data) +13. ```interpolation_polynomial``` can be 1 for hermite or 2 for + polynomial interpolation method of CVODES +14. ```solver_f``` solver used for forward ODE problem: 1=Adams, + 2=bdf, 3=bdf_iterated +15. ```solver_b``` solver used for backward ODE problem: 1=Adams, + 2=bdf, 3=bdf_iterated +16. ```arg1, arg2, ...``` - Arguments passed unmodified to the ODE right +hand side. The types ```T1, T2, ...``` can be any type, but they must match +the types of the matching arguments of ```f```. + + # Drawbacks [drawbacks]: #drawbacks @@ -212,7 +261,10 @@ to my knowledge to get large ODE systems working in Stan. What we are missing out for now is to exploit the sparsity structure of the ODE. This would allow for more efficient solvers and even larger systems, but this is not possible at the moment to figure out -structurally the inter-dependencies of inputs and outputs. +structurally the inter-dependencies of inputs and outputs. However, +the matrix free bdf method may already take advantage sufficiently +enough of the sparsity, since in this case we avoid evaluation of the +full Jacobian to some extent. # Rationale and alternatives [rationale-and-alternatives]: #rationale-and-alternatives @@ -235,23 +287,33 @@ and heavy to craft on our own. # Prior art [prior-art]: #prior-art +The adjoint sensitivity method is not very widley used. It's somewhat +present in engineering literature and found it's way into packages +like `DifferentialEquations.jl` in Julia. Another noticeable reference +is +https://www.mcs.anl.gov/~hongzh/publication/zhang-2014/SISC_FATODE_final.pdf +and a discourse post from Ben +https://discourse.mc-stan.org/t/on-adjoint-sensitivity-of-ode-pde/5148/16 +. + +Another domain where the adjoint sensitivity method is beind used is +in systems biology. The (AMICI)[https://github.com/AMICI-dev/AMICI] +toolkit has been build to solve ODE system for large scale systems +biology equation systems to be used within optimizer software. The +adjoint method is implemented in AMICI using CVODES and has been +benchmarked on various problems from systems biology as +(published)[https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005331] +(see also (this arxiv pre-print)[https://arxiv.org/abs/2012.09122]). + Discuss prior art, both the good and the bad, in relation to this proposal. A few examples of what this can include are: -- For language, library, tools, and compiler proposals: Does this feature exist in other programming languages and what experience have their community had? -- For community proposals: Is this done by some other community and what were their experiences with it? -- For other teams: What lessons can we learn from what other communities have done here? -- Papers: Are there any published papers or great posts that discuss this? If you have some relevant papers to refer to, this can serve as a more detailed theoretical background. - -This section is intended to encourage you as an author to think about the lessons from other languages, provide readers of your RFC with a fuller picture. -If there is no prior art, that is fine - your ideas are interesting to us whether they are brand new or if it is an adaptation from other languages. - -Note that while precedent set by other languages is some motivation, it does not on its own motivate an RFC. -Please also take into consideration that rust sometimes intentionally diverges from common language features. # Unresolved questions [unresolved-questions]: #unresolved-questions -- What parts of the design do you expect to resolve through the RFC process before this gets merged? -- What parts of the design do you expect to resolve through the implementation of this feature before stabilization? -- What related issues do you consider out of scope for this RFC that could be addressed in the future independently of the solution that comes out of this RFC? +The main unresolved question at this point is to settle on the set of +exposed tuning parameters. Currently the proposal is to expose in an +experimental version a large super-set of tuning parameters and ask +the Stan community to experiment with it in order to weed out some +tuning parameters if possible. From 10832abab62941aa58c7a7fa62fecd6d44ec2386 Mon Sep 17 00:00:00 2001 From: Sebastian Weber Date: Thu, 18 Feb 2021 11:39:01 +0100 Subject: [PATCH 04/14] fix links --- designs/0027-adjoint-ode.md | 13 +++++-------- 1 file changed, 5 insertions(+), 8 deletions(-) diff --git a/designs/0027-adjoint-ode.md b/designs/0027-adjoint-ode.md index 0a38260..1017002 100644 --- a/designs/0027-adjoint-ode.md +++ b/designs/0027-adjoint-ode.md @@ -290,20 +290,17 @@ and heavy to craft on our own. The adjoint sensitivity method is not very widley used. It's somewhat present in engineering literature and found it's way into packages like `DifferentialEquations.jl` in Julia. Another noticeable reference -is -https://www.mcs.anl.gov/~hongzh/publication/zhang-2014/SISC_FATODE_final.pdf -and a discourse post from Ben -https://discourse.mc-stan.org/t/on-adjoint-sensitivity-of-ode-pde/5148/16 -. +is [FATODE](https://www.mcs.anl.gov/~hongzh/publication/zhang-2014/SISC_FATODE_final.pdf) +and a [discourse post from Ben](https://discourse.mc-stan.org/t/on-adjoint-sensitivity-of-ode-pde/5148/16). Another domain where the adjoint sensitivity method is beind used is -in systems biology. The (AMICI)[https://github.com/AMICI-dev/AMICI] +in systems biology. The [AMICI](https://github.com/AMICI-dev/AMICI) toolkit has been build to solve ODE system for large scale systems biology equation systems to be used within optimizer software. The adjoint method is implemented in AMICI using CVODES and has been benchmarked on various problems from systems biology as -(published)[https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005331] -(see also (this arxiv pre-print)[https://arxiv.org/abs/2012.09122]). +[published](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005331) +(see also [this arxiv pre-print](https://arxiv.org/abs/2012.09122)). Discuss prior art, both the good and the bad, in relation to this proposal. A few examples of what this can include are: From 92e173efc04aaa3d3dae277dcc7b8681f1008543 Mon Sep 17 00:00:00 2001 From: Sebastian Weber Date: Mon, 1 Mar 2021 22:10:08 +0100 Subject: [PATCH 05/14] refine function calls --- designs/0027-adjoint-ode.md | 41 +++++++++++++++++++++++++++++++++---- 1 file changed, 37 insertions(+), 4 deletions(-) diff --git a/designs/0027-adjoint-ode.md b/designs/0027-adjoint-ode.md index 1017002..5bd74dc 100644 --- a/designs/0027-adjoint-ode.md +++ b/designs/0027-adjoint-ode.md @@ -208,14 +208,16 @@ Jacobian-vector products $\frac{df}{dy}\,v$, which we can compute directly using *forward mode*. In addition all 3 integration problems have their own relative and absolute tolerance targets. -The propsed function should has the following signature: + + +The proposed function should has the following signature: ```stan -vector[] ode_adjoint(F f, +vector[] ode_adjoint_tol_ctl(F f, vector y0, real t0, real[] times, real rel_tol_f, vector abs_tol_f, - real rel_tol_b, real abs_tol_b, + real rel_tol_b, vector abs_tol_b, real rel_tol_q, real abs_tol_q, int max_num_steps, int num_checkpoints, @@ -234,7 +236,7 @@ The arguments are: 6. ```abs_tol_f``` - Absolute tolerance vector for each state for forward solve (data) 7. ```rel_tol_b``` - Relative tolerance for backward solve (data) -8. ```abs_tol_b``` - Absolute tolerance for backward solve (data) +8. ```abs_tol_b``` - Absolute tolerance vector for each state backward solve (data) 9. ```rel_tol_q``` - Relative tolerance for backward quadrature (data) 10. ```abs_tol_q``` - Absolute tolerance for backward quadrature (data) 11. ```max_num_steps``` - Maximum number of timesteps to take in integrating @@ -252,6 +254,37 @@ The arguments are: hand side. The types ```T1, T2, ...``` can be any type, but they must match the types of the matching arguments of ```f```. +In addition a function version is made available which mimics the +existing ode interface for tolerances. This version will allow users +to quickly try out the new adjoint interface. **The extra control +parameters will be set based on best guesses as of now and are subject +to change**: + + +```stan +vector[] ode_adjoint_tol(F f, + vector y0, + real t0, real[] times, + real rel_tol, real abs_tol, + int max_num_steps, + T1 arg1, T2 arg2, ...) +``` + +The arguments are: +1. ```f``` - User-defined right hand side of the ODE (`dy/dt = f`) +2. ```y0``` - Initial state of the ode solve (`y0 = y(t0)`) +3. ```t0``` - Initial time of the ode solve +4. ```times``` - Sorted arary of times to which the ode will be solved (each + element must be greater than t0) +5. ```rel_tol``` - Relative tolerance for all solves (data) +6. ```abs_tol``` - Absolute tolerance for all solves (data) +7. ```max_num_steps``` - Maximum number of timesteps to take in integrating + the ODE solution between output time points for forward and backward + solve (data) +8. ```arg1, arg2, ...``` - Arguments passed unmodified to the ODE right +hand side. The types ```T1, T2, ...``` can be any type, but they must match +the types of the matching arguments of ```f```. + # Drawbacks [drawbacks]: #drawbacks From 59c96a5b0720cb41b5485fc77ef843275625076c Mon Sep 17 00:00:00 2001 From: Sebastian Weber Date: Sat, 6 Mar 2021 23:25:21 +0100 Subject: [PATCH 06/14] address Mikes comments part 1 --- designs/0027-adjoint-ode.md | 50 ++++++++++++++++++++++--------------- 1 file changed, 30 insertions(+), 20 deletions(-) diff --git a/designs/0027-adjoint-ode.md b/designs/0027-adjoint-ode.md index 5bd74dc..40e19a9 100644 --- a/designs/0027-adjoint-ode.md +++ b/designs/0027-adjoint-ode.md @@ -14,10 +14,10 @@ implement in Stan the so-called forward-mode ODE method in order to obtain in addition to the solution of the ODE the sensitivities of the ODE. The cost of this method scales as -$$ N \cdot M . $$ +$$ N \cdot (M + 1). $$ -The computational cost of the adjoint method is much more favorable -in comparison which is +The computational cost of the adjoint method scales much more +favorable in comparison which is $$ 2 \cdot N + M .$$ @@ -25,8 +25,12 @@ The advantage of the adjoint methods shows in particular whenever the number of parameters are relatively large in comparison to the number of states. Most importantly, the computational cost grows only linearly in the number of states and parameters (while forward grows -exponentially). Thus, this method can scale to much larger problems -without exponentially increasing computational cost. +multiplicatively). Thus, this method can scale to much larger problems +without a multiplicative increasing computational cost. However, the +adjoint method is more involved resulting in a bigger overhead than +the forward mode method. As a result, the forward method can +have advantages for smaller ODE systems and as such both methods +remain valuable within Stan. # Motivation [motivation]: #motivation @@ -302,14 +306,16 @@ full Jacobian to some extent. # Rationale and alternatives [rationale-and-alternatives]: #rationale-and-alternatives -There is no other analytical technique available which scales in a -comparable way. Hence, larger ODE problems ( many parameters and/or -states) are currently out of scope for Stan and the adjoint technique -does enable Stan to cope with these larger systems. +There is no other analytical ODE solving technique available which scales in a +comparable way. The better scalability of the adjoint method enables +at a fixed computational ressource the possibility to solve larger ODE +systems with many parameters and/or states to be solved faster. Thus, +a Stan modeler can increase the complexity of an ODE model under study +without subtantially increasing model runtimes. -The numerical complexities are rather involved as 3 nested -integrations are performed. This makes things somewhat fragile and -less robust. What makes the backward integration in +The numerical complexities of the adjoint method are rather involved +as 3 nested integrations are performed. This makes things somewhat +fragile and less robust. What makes the backward integration in particular involved is that the solution of the forward problem must be stored as a continuous function in memory and hence an interpolation of the forward solve is required. This is provided by @@ -320,11 +326,13 @@ and heavy to craft on our own. # Prior art [prior-art]: #prior-art -The adjoint sensitivity method is not very widley used. It's somewhat -present in engineering literature and found it's way into packages -like `DifferentialEquations.jl` in Julia. Another noticeable reference -is [FATODE](https://www.mcs.anl.gov/~hongzh/publication/zhang-2014/SISC_FATODE_final.pdf) -and a [discourse post from Ben](https://discourse.mc-stan.org/t/on-adjoint-sensitivity-of-ode-pde/5148/16). +The adjoint sensitivity method is applied in within different domains +of science. For example the method is present in engineering +literature and found it's way into packages like +`DifferentialEquations.jl` in Julia. Another noticeable reference is +[FATODE](https://www.mcs.anl.gov/~hongzh/publication/zhang-2014/SISC_FATODE_final.pdf) +and a [discourse post from +Ben](https://discourse.mc-stan.org/t/on-adjoint-sensitivity-of-ode-pde/5148/16). Another domain where the adjoint sensitivity method is beind used is in systems biology. The [AMICI](https://github.com/AMICI-dev/AMICI) @@ -335,9 +343,11 @@ benchmarked on various problems from systems biology as [published](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005331) (see also [this arxiv pre-print](https://arxiv.org/abs/2012.09122)). -Discuss prior art, both the good and the bad, in relation to this proposal. -A few examples of what this can include are: - +However, so far the adjoint method is commonly build into special +purpose software as the method requires a fixed target +functional. The integration of the adjoint ODE sensitivity method into +a general puropse autodiff library as Stan math is presumably the +first implementation of its kind. # Unresolved questions [unresolved-questions]: #unresolved-questions From 1b94e37aed95120fc2799d611b234941915e3ee4 Mon Sep 17 00:00:00 2001 From: Sebastian Weber Date: Sun, 7 Mar 2021 16:42:39 +0100 Subject: [PATCH 07/14] adress remaining sections --- designs/0027-adjoint-ode.md | 126 +++++++++++++++++++++++------------- 1 file changed, 80 insertions(+), 46 deletions(-) diff --git a/designs/0027-adjoint-ode.md b/designs/0027-adjoint-ode.md index 40e19a9..3c04dc7 100644 --- a/designs/0027-adjoint-ode.md +++ b/designs/0027-adjoint-ode.md @@ -61,12 +61,17 @@ existing forward mode solvers will be that the gradients are not any more precomputed during the forward sweep of reverse mode autodiff (AD). Instead, during the forward sweep of reverse mode autodiff, the ODE is solved only for the N states (without any sensitivities). When -reverse mode AD then performs the backward sweep, the adjoints of the -input operands are directly computed given the input adjoints. This -involves a backward solve in time with N states and solving $M$ -quadrature problems in addition. - -The numerical complexity is higher for the adjoint method in +reverse mode AD then performs the backward sweep, the autodiff +adjoints of the parameters of the ODE (input operands to the ODE +function call) are directly computed given the parent autodiff +adjoints as defined by the autodiff call graph. This calculation +involves a backward solve of an adjoint ODE system with $N$ so-called +ODE adjoint states as defined by the adjoint method and explained +below. Along the backward integration of the ODE adjoint problem one +has in addition to solve $M$ one-dimensional quadrature problems (one +for each parameter of the ODE). + +The numerical complexity is higher for the ODE adjoint method in comparison to the forward method. While most of the complexity is handled by CVODES, the numerical procedure seems to require more knowledge about the tuning parameters. At least at the moment it @@ -78,21 +83,31 @@ as proposed are: - absolute & relative tolerance forward solve; absolute tolerance should be a vector of length N, the number of states -- absolute & relative tolerance backward solve +- absolute & relative tolerance backward solve (for consistency the + absolute tolerances are given as vector) - absolute & relative tolerance quadrature problem -- forward solver: bdf-dense / bdf-iterated / adams - 1 / 2 / 3 -- backward solver: bdf-dense / bdf-iterated / adams - 1 / 2 / 3 +- maximal number of steps between time-points; same for forward and + backward integration +- forward solver: bdf / adams - 1 / 2 +- backward solver: bdf / adams - 1 / 2 - checkpointing: number of checkpoints every X steps - checkpointing: hermite or polynomial approximation - 1 / 2 During an experimental phase of this feature we can hopefully learn -which of these are relevant and which can be dropped for a final -release of the `ode_adjoint` function. +which of these are relevant. In order to make it easy for users to +switch their existing ODE problems to the new solver an +`ode_adjoint_tol` function will be provided during the experimental +phase. This function call follows the same conventions as the existing +ode integrators. The tuning parameters set as default will be +described below. The simplified call will be integrated into the main +stan math library if users find this function helpful and a reasonable +default for the tuning parameters can be found during the experimental +phase. # Reference-level explanation [reference-level-explanation]: #reference-level-explanation -The adjoint method for ODEs is relatively involved mathematically. The +The ODE adjoint method for ODEs is relatively involved mathematically. The main challenge comes from a mix of different notation and conventions used in different fields. Here we relate commonly used notation within Stan math to the [user guide of @@ -100,7 +115,8 @@ CVODES](https://computing.llnl.gov/sites/default/files/public/cvs_guide-5.6.1.pd the underlying library we employ to handle the numerical solution. The goal of reverse mode autodiff is to calculate the derivative of some function $l(\theta)$ wrt. to it's parameters -$\theta$ at some particular realization of $\theta$. The function $l$ +$\theta$ at some particular realization of $\theta$; here $\theta$ +denotes a set of parameters. The function $l$ now depends on the solution of an ODE given as initial value problem \begin{align} @@ -113,24 +129,29 @@ parametrized in terms of parameters $p$ which are a subset of $\theta$. Let's now further assume for simplicity that the function $l$ only depends on the solution of the ODE at the time-point $T$. During the reverse sweep of reverse mode autodiff we are then -given the adjoints of $a_{l,y_n}$ wrt to the output state $y(T,p)$ which +given the autodiff adjoints of $a_{l,y_n}$ wrt to the output state +$y(T,p)$. These are the partials of $l$ wrt to each state $$ a_{l,y_n} = \left.\frac{\partial l}{\partial y_n}\right|_{t=T} $$ -and we must calculate the contribution to the adjoint of the parameters +and we must calculate the contribution to the autodiff adjoint of the +parameters $$ a_{l,p_m} = \sum_{n=1}^{N} \left. a_{l,y_n} \, \frac{\partial y_n}{\partial p_m}\right|_{t=T}, $$ which involves for each parameter $p_m$ the partial of every state -$y_n$ wrt to the parameter $p_m$. Note that the computational benefit of the -adjoint method stems from calculating the above sums *directly* in -contrast to the forward method which calculates every partial -$\left. \frac{\partial y_n}{\partial p_m}\right|_{t=T}$ separatley. +$y_n$ wrt to the parameter $p_m$. Note that the computational benefit +of the ODE adjoint method stems from calculating the above sums +*directly* in contrast to the forward method which calculates every +partial $\left. \frac{\partial y_n}{\partial p_m}\right|_{t=T}$ +separatley. In fact, the above sum is equal to a product of a +row-vector of autodiff adjoints with the transpose of the Jacobian of +the ODE solution wrt to the parameters at time $T$. In the notation of the CVODES user manual, the function $g(t,y,p)$ **is equal to** $l(\theta)$ essentially. Through the use of Lagrange -multipliers, the adjoint problem is transferred to a backward problem +multipliers, the ODE adjoint problem is transferred to a backward problem in equation 2.20 of the CVODES manual ([see here for a step-by-step derivation](https://arxiv.org/abs/2002.00326)), @@ -175,19 +196,20 @@ being subject to autodiff we see that $$\left(\frac{\partial g}{\partial y_n}\right)^*_{t=T} = a_{l,y_n}^*, $$ -which is the input we get during reverse mode autodiff (the conjugate -transpose does not apply for Stan math using reals only). +which is the input we get during reverse mode autodiff (the conjugation +does not apply for Stan math using reals only). It is important to note that the backward ODE problem requires as -input the adjoint $a_{l,y_n}^*$ such that the backward problem must be -solved during the reverse pass of reverse mode autodiff. Thus, CVODES -is used during the forward pass to *solve (forward)* the ODE in -\@ref(eq:odeDef). Once the backward pass is initiated, the adjoints -$a_{l,y_n}^*$ are used as the terminal value in order to solve the -*backward problem* of \@ref(eq:muODE). Whenever any parameter -sensitivities are requested, then we must solve along the backward -problem an additional quadrature problem, which is the integrand in -equation \@ref(eq:derivg). +input the autodiff adjoint $a_{l,y_n}^*$ such that the backward +problem must be solved during the reverse pass of reverse mode +autodiff. Thus, CVODES is used during the forward pass to *solve +(forward)* the ODE in \@ref(eq:odeDef). Once the backward pass is +initiated, the autodiff adjoints $a_{l,y_n}^*$ are used as the +terminal value in order to solve the *backward problem* of +\@ref(eq:muODE). Whenever any parameter sensitivities are requested, +then we must solve along the backward problem an additional +one-dimensional quadrature problem per parameter of the ODE, which is +the integrand in equation \@ref(eq:derivg). So far we have only considered the case of a single time-point $T$. For multiple time-points one merely starts the backward @@ -200,18 +222,12 @@ forward ODE problem, a backward ODE problem and a backward quadrature problem. The forward ODE and the backward ODE problem can be solved with either a non-stiff Adams or a stiff BDF solver of CVODES (the choice is independent for each problem). For the Newton steps of the -BDF routine a linear solver routine is required. The routine can -either be a dense solver or an approximate iterative solver. As we -target large ODE systems it can be useful to allow users to choose the -solver being used. For example, whenever the number of ODE states is -very large, then an iterative solver may be preferable. The reason is -that the dense solver will require the calculation of the full -Jacobian $\frac{df}{dy}$ of the ODE RHS in \@ref(eq:odeDef) wrt to the -states $y$. In contrast, the iterative solver only needs to evaluate -Jacobian-vector products $\frac{df}{dy}\,v$, which we can compute -directly using *forward mode*. In addition all 3 integration problems -have their own relative and absolute tolerance targets. - +BDF routine a linear solver routine is required for which a dense +solver is used, but sparse solvers could be explored eventually +(initial prototypes with a sparse solvers did not show any benefits, +but further work is needed to fully assess this). In addition all 3 +integration problems have their own relative and absolute tolerance +targets. The proposed function should has the following signature: @@ -289,6 +305,16 @@ The arguments are: hand side. The types ```T1, T2, ...``` can be any type, but they must match the types of the matching arguments of ```f```. +The buest guesses based on preliminary experiments for the remaining +tuning parameters are then set as + +- same relative tolerance in all problems +- `abs_tol_f = abs_tol / 100` +- `abs_tol_b = abs_tol / 10` +- `abs_tol_q = abs_tol` +- $250$ steps between checkpoints +- bdf solver for forward and backward solve +- polynomial method for forward function approximation # Drawbacks [drawbacks]: #drawbacks @@ -354,6 +380,14 @@ first implementation of its kind. The main unresolved question at this point is to settle on the set of exposed tuning parameters. Currently the proposal is to expose in an -experimental version a large super-set of tuning parameters and ask -the Stan community to experiment with it in order to weed out some -tuning parameters if possible. +experimental version a large super-set of tuning parameters and a +version which resembles the enhanced interface with tolerances of the +existing ODE solvers. This allows users to quickly try out the adjoint +method by mereley changing a function name. The suggestion is to +collect feedback from the Stan community duing an experimental phase +of this feature. Once feedback on the utility of the simplifed +interface and the appropiatness of the set defaults are collected a +decision should be done about including this signature in a first +version included in the next release. Another goal of the experimental +phase is to explore the possiblity to weed out some tuing parameters +if possible. From 613984322c1960eb0a87091348f77acbc1ecce1d Mon Sep 17 00:00:00 2001 From: Sebastian Weber Date: Sat, 20 Mar 2021 22:31:32 +0100 Subject: [PATCH 08/14] update --- designs/0027-adjoint-ode.md | 91 +++++++++++++++++-------------------- 1 file changed, 41 insertions(+), 50 deletions(-) diff --git a/designs/0027-adjoint-ode.md b/designs/0027-adjoint-ode.md index 3c04dc7..ba53364 100644 --- a/designs/0027-adjoint-ode.md +++ b/designs/0027-adjoint-ode.md @@ -6,10 +6,10 @@ # Summary [summary]: #summary -Ordinary differential equations (ODEs) are currently severly costly to +Ordinary differential equations (ODEs) are currently severely costly to fit in Stan for larger systems. This is due to the multiplicative -scaling of the required computing ressources with the ODE system size -N and the number of parameters M defining the ODE. Currently we +scaling of the required computing resources with the ODE system size +$N$ and the number of parameters $M$ defining the ODE. Currently we implement in Stan the so-called forward-mode ODE method in order to obtain in addition to the solution of the ODE the sensitivities of the ODE. The cost of this method scales as @@ -37,28 +37,28 @@ remain valuable within Stan. Stan is currently practically limited to solve problems with ODEs which are small in the number of states and parameters. If either of -them gets large, the computational cost explodes quickly. With the +them gets large, the computational cost explode quickly. With the adjoint ODE solving method Stan will be able to scale to much larger problems involving more states and more parameters. Examples are large pharmacokinetic / pharmacodynamic systems or physiological based pharmacokinetic models or bigger susceptible infectious & removed models. The adjoint ODE method will grow the computational cost only -linearly with states and parameters and therefore the modelers can -more freely increase the complexity of the ODE model without having to -worry too much about infeasible inference times. +linearly with states and parameters and therefore modelers can +more freely increase the complexity of the ODE model without +substantially increasing inference times. # Guide-level explanation [guide-level-explanation]: #guide-level-explanation The new solver `ode_adjoint` will follow in it's design the variadic ODE functions. As the new solver is numerically more involved (see -reference level) there are merely more tuning parameters needed for -the solver. The actual impementation will be provided by the CVODES +reference level) there are more tuning parameters needed for +the solver. The actual implementation will be provided by the CVODES package which we already include in Stan. From a more internal stan math perspective, the key difference to the existing forward mode solvers will be that the gradients are not any -more precomputed during the forward sweep of reverse mode autodiff +more pre-computed during the forward sweep of reverse mode autodiff (AD). Instead, during the forward sweep of reverse mode autodiff, the ODE is solved only for the N states (without any sensitivities). When reverse mode AD then performs the backward sweep, the autodiff @@ -82,14 +82,14 @@ available (if that is feasible at all). The tuning parameters exposed as proposed are: - absolute & relative tolerance forward solve; absolute tolerance - should be a vector of length N, the number of states + should be a vector of length $N$, the number of states - absolute & relative tolerance backward solve (for consistency the absolute tolerances are given as vector) - absolute & relative tolerance quadrature problem - maximal number of steps between time-points; same for forward and backward integration -- forward solver: bdf / adams - 1 / 2 -- backward solver: bdf / adams - 1 / 2 +- forward solver: adams / bdf - 1 / 2 +- backward solver: adams / bdf - 1 / 2 - checkpointing: number of checkpoints every X steps - checkpointing: hermite or polynomial approximation - 1 / 2 @@ -125,7 +125,7 @@ now depends on the solution of an ODE given as initial value problem \end{align} The ODE has $N$ states, $y=(y_1, \ldots, y_N)$, and is -parametrized in terms of parameters $p$ which are a subset of +parameterized in terms of parameters $p$ which are a subset of $\theta$. Let's now further assume for simplicity that the function $l$ only depends on the solution of the ODE at the time-point $T$. During the reverse sweep of reverse mode autodiff we are then @@ -145,7 +145,7 @@ $y_n$ wrt to the parameter $p_m$. Note that the computational benefit of the ODE adjoint method stems from calculating the above sums *directly* in contrast to the forward method which calculates every partial $\left. \frac{\partial y_n}{\partial p_m}\right|_{t=T}$ -separatley. In fact, the above sum is equal to a product of a +separately. In fact, the above sum is equal to a product of a row-vector of autodiff adjoints with the transpose of the Jacobian of the ODE solution wrt to the parameters at time $T$. @@ -212,7 +212,7 @@ one-dimensional quadrature problem per parameter of the ODE, which is the integrand in equation \@ref(eq:derivg). So far we have only considered the case of a single time-point -$T$. For multiple time-points one merely starts the backward +$T$. For multiple time-points one starts the backward integration from the last time-point and keeps accumulating the respective terms which result from integrating backwards in steps until $t_0$ is reached. @@ -221,16 +221,10 @@ In total we need to solve 3 integration problems which consist of a forward ODE problem, a backward ODE problem and a backward quadrature problem. The forward ODE and the backward ODE problem can be solved with either a non-stiff Adams or a stiff BDF solver of CVODES (the -choice is independent for each problem). For the Newton steps of the -BDF routine a linear solver routine is required for which a dense -solver is used, but sparse solvers could be explored eventually -(initial prototypes with a sparse solvers did not show any benefits, -but further work is needed to fully assess this). In addition all 3 -integration problems have their own relative and absolute tolerance -targets. +choice is independent for each problem). Each of the 3 integration +problems have their own relative and absolute tolerance targets. - -The proposed function should has the following signature: +The proposed function should have the following signature: ```stan vector[] ode_adjoint_tol_ctl(F f, @@ -240,7 +234,7 @@ vector[] ode_adjoint_tol_ctl(F f, real rel_tol_b, vector abs_tol_b, real rel_tol_q, real abs_tol_q, int max_num_steps, - int num_checkpoints, + int num_steps_between_checkpoints, int interpolation_polynomial, int solver_f, int solver_b T1 arg1, T2 arg2, ...) @@ -259,17 +253,17 @@ The arguments are: 8. ```abs_tol_b``` - Absolute tolerance vector for each state backward solve (data) 9. ```rel_tol_q``` - Relative tolerance for backward quadrature (data) 10. ```abs_tol_q``` - Absolute tolerance for backward quadrature (data) -11. ```max_num_steps``` - Maximum number of timesteps to take in integrating +11. ```max_num_steps``` - Maximum number of time-steps to take in integrating the ODE solution between output time points for forward and backward solve (data) -12. ```num_checkpoints``` number of steps between checkpointing forward +12. ```num_steps_between_checkpoints``` number of steps between checkpointing forward solution (data) 13. ```interpolation_polynomial``` can be 1 for hermite or 2 for polynomial interpolation method of CVODES 14. ```solver_f``` solver used for forward ODE problem: 1=Adams, - 2=bdf, 3=bdf_iterated + 2=bdf 15. ```solver_b``` solver used for backward ODE problem: 1=Adams, - 2=bdf, 3=bdf_iterated + 2=bdf 16. ```arg1, arg2, ...``` - Arguments passed unmodified to the ODE right hand side. The types ```T1, T2, ...``` can be any type, but they must match the types of the matching arguments of ```f```. @@ -294,18 +288,18 @@ The arguments are: 1. ```f``` - User-defined right hand side of the ODE (`dy/dt = f`) 2. ```y0``` - Initial state of the ode solve (`y0 = y(t0)`) 3. ```t0``` - Initial time of the ode solve -4. ```times``` - Sorted arary of times to which the ode will be solved (each +4. ```times``` - Sorted array of times to which the ode will be solved (each element must be greater than t0) 5. ```rel_tol``` - Relative tolerance for all solves (data) 6. ```abs_tol``` - Absolute tolerance for all solves (data) -7. ```max_num_steps``` - Maximum number of timesteps to take in integrating +7. ```max_num_steps``` - Maximum number of time-steps to take in integrating the ODE solution between output time points for forward and backward solve (data) 8. ```arg1, arg2, ...``` - Arguments passed unmodified to the ODE right hand side. The types ```T1, T2, ...``` can be any type, but they must match the types of the matching arguments of ```f```. -The buest guesses based on preliminary experiments for the remaining +The best guesses based on preliminary experiments for the remaining tuning parameters are then set as - same relative tolerance in all problems @@ -314,7 +308,7 @@ tuning parameters are then set as - `abs_tol_q = abs_tol` - $250$ steps between checkpoints - bdf solver for forward and backward solve -- polynomial method for forward function approximation +- hermite polynomial method for forward function approximation # Drawbacks [drawbacks]: #drawbacks @@ -324,20 +318,17 @@ to my knowledge to get large ODE systems working in Stan. What we are missing out for now is to exploit the sparsity structure of the ODE. This would allow for more efficient solvers and even larger systems, but this is not possible at the moment to figure out -structurally the inter-dependencies of inputs and outputs. However, -the matrix free bdf method may already take advantage sufficiently -enough of the sparsity, since in this case we avoid evaluation of the -full Jacobian to some extent. +structurally the inter-dependencies of inputs and outputs. # Rationale and alternatives [rationale-and-alternatives]: #rationale-and-alternatives There is no other analytical ODE solving technique available which scales in a comparable way. The better scalability of the adjoint method enables -at a fixed computational ressource the possibility to solve larger ODE +at a fixed computational resource the possibility to solve larger ODE systems with many parameters and/or states to be solved faster. Thus, a Stan modeler can increase the complexity of an ODE model under study -without subtantially increasing model runtimes. +without substantially increasing model runtimes. The numerical complexities of the adjoint method are rather involved as 3 nested integrations are performed. This makes things somewhat @@ -352,15 +343,15 @@ and heavy to craft on our own. # Prior art [prior-art]: #prior-art -The adjoint sensitivity method is applied in within different domains -of science. For example the method is present in engineering +The adjoint sensitivity method is applied within different domains +of science. For example, the method is present in engineering literature and found it's way into packages like `DifferentialEquations.jl` in Julia. Another noticeable reference is [FATODE](https://www.mcs.anl.gov/~hongzh/publication/zhang-2014/SISC_FATODE_final.pdf) and a [discourse post from Ben](https://discourse.mc-stan.org/t/on-adjoint-sensitivity-of-ode-pde/5148/16). -Another domain where the adjoint sensitivity method is beind used is +Another domain where the adjoint sensitivity method is being used is in systems biology. The [AMICI](https://github.com/AMICI-dev/AMICI) toolkit has been build to solve ODE system for large scale systems biology equation systems to be used within optimizer software. The @@ -372,7 +363,7 @@ benchmarked on various problems from systems biology as However, so far the adjoint method is commonly build into special purpose software as the method requires a fixed target functional. The integration of the adjoint ODE sensitivity method into -a general puropse autodiff library as Stan math is presumably the +a general purpose autodiff library as Stan math is presumably the first implementation of its kind. # Unresolved questions @@ -383,11 +374,11 @@ exposed tuning parameters. Currently the proposal is to expose in an experimental version a large super-set of tuning parameters and a version which resembles the enhanced interface with tolerances of the existing ODE solvers. This allows users to quickly try out the adjoint -method by mereley changing a function name. The suggestion is to -collect feedback from the Stan community duing an experimental phase -of this feature. Once feedback on the utility of the simplifed -interface and the appropiatness of the set defaults are collected a +method by merely changing a function name. The suggestion is to +collect feedback from the Stan community during an experimental phase +of this feature. Once feedback on the utility of the simplified +interface and the appropriateness of the set defaults are collected a decision should be done about including this signature in a first version included in the next release. Another goal of the experimental -phase is to explore the possiblity to weed out some tuing parameters -if possible. +phase is to explore the possibility to weed out some tuning parameters +if possible or add others if needed. From 3185b30c38e41a4374ea656b6b0f2a49eb3faab0 Mon Sep 17 00:00:00 2001 From: Sebastian Weber Date: Mon, 22 Mar 2021 21:43:00 +0100 Subject: [PATCH 09/14] bump --- designs/0027-adjoint-ode.md | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/designs/0027-adjoint-ode.md b/designs/0027-adjoint-ode.md index ba53364..16a5fd6 100644 --- a/designs/0027-adjoint-ode.md +++ b/designs/0027-adjoint-ode.md @@ -6,13 +6,13 @@ # Summary [summary]: #summary -Ordinary differential equations (ODEs) are currently severely costly to -fit in Stan for larger systems. This is due to the multiplicative +Ordinary differential equations (ODEs) are currently expensive to +evaluate in Stan for larger systems. This is due to the multiplicative scaling of the required computing resources with the ODE system size $N$ and the number of parameters $M$ defining the ODE. Currently we -implement in Stan the so-called forward-mode ODE method in order to -obtain in addition to the solution of the ODE the sensitivities of the -ODE. The cost of this method scales as +implement in Stan the forward-mode (direct) sensitivity ODE method in +order to obtain in addition to the solution of the ODE the +sensitivities of the ODE. The cost of this method scales as $$ N \cdot (M + 1). $$ @@ -41,9 +41,9 @@ them gets large, the computational cost explode quickly. With the adjoint ODE solving method Stan will be able to scale to much larger problems involving more states and more parameters. Examples are large pharmacokinetic / pharmacodynamic systems or physiological based -pharmacokinetic models or bigger susceptible infectious & removed -models. The adjoint ODE method will grow the computational cost only -linearly with states and parameters and therefore modelers can +pharmacokinetic models or bigger susceptible, infected, and recovered +(SIR) models. The adjoint ODE method will grow the computational cost +only linearly with states and parameters and therefore modelers can more freely increase the complexity of the ODE model without substantially increasing inference times. @@ -68,8 +68,8 @@ adjoints as defined by the autodiff call graph. This calculation involves a backward solve of an adjoint ODE system with $N$ so-called ODE adjoint states as defined by the adjoint method and explained below. Along the backward integration of the ODE adjoint problem one -has in addition to solve $M$ one-dimensional quadrature problems (one -for each parameter of the ODE). +has in addition to solve $M$ one-dimensional quadrature problems which +depend on the adjoint states (one problem for each parameter of the ODE). The numerical complexity is higher for the ODE adjoint method in comparison to the forward method. While most of the complexity is From e4756806ef63cd2bfa3b1eb8c6576c4d121fd732 Mon Sep 17 00:00:00 2001 From: Sebastian Weber Date: Tue, 23 Mar 2021 21:17:31 +0100 Subject: [PATCH 10/14] upd --- designs/0027-adjoint-ode.md | 64 ++++++++++++++++++++++--------------- 1 file changed, 38 insertions(+), 26 deletions(-) diff --git a/designs/0027-adjoint-ode.md b/designs/0027-adjoint-ode.md index 16a5fd6..3250663 100644 --- a/designs/0027-adjoint-ode.md +++ b/designs/0027-adjoint-ode.md @@ -107,31 +107,31 @@ phase. # Reference-level explanation [reference-level-explanation]: #reference-level-explanation -The ODE adjoint method for ODEs is relatively involved mathematically. The -main challenge comes from a mix of different notation and conventions -used in different fields. Here we relate commonly used notation within -Stan math to the [user guide of +The ODE adjoint method for ODEs is relatively involved +mathematically. The main challenge comes from a mix of different +notation and conventions used in different fields. Here we relate +commonly used notation within Stan math to the [user guide of CVODES](https://computing.llnl.gov/sites/default/files/public/cvs_guide-5.6.1.pdf), -the underlying library we employ to handle the numerical -solution. The goal of reverse mode autodiff is to calculate the -derivative of some function $l(\theta)$ wrt. to it's parameters -$\theta$ at some particular realization of $\theta$; here $\theta$ -denotes a set of parameters. The function $l$ -now depends on the solution of an ODE given as initial value problem +the underlying library we employ to handle the numerical solution. The +goal of reverse mode autodiff is to calculate the derivative of some +function $l(\theta)$ wrt. to it's parameters $\theta$ at some +particular realization of $\theta$; here $\theta$ denotes a set of +parameters. For example, the likelihood is defined as the normal log +probability density for a single observation at time-point $T$. The +mean of this normal log probability density is modelled as the +solution of an ODE. The ODE itself depends on some parameters $p$ +which are a subset of $\theta$. The ODE is given as initial value +problem \begin{align} \dot{y} &= f(y,t,p) (\#eq:odeDef) \\ y(t_0) &= y^{(0)}. (\#eq:odeInitial) \end{align} -The ODE has $N$ states, $y=(y_1, \ldots, y_N)$, and is -parameterized in terms of parameters $p$ which are a subset of -$\theta$. Let's now further assume for simplicity that the function -$l$ only depends on the solution of the ODE at the time-point -$T$. During the reverse sweep of reverse mode autodiff we are then -given the autodiff adjoints of $a_{l,y_n}$ wrt to the output state -$y(T,p)$. These -are the partials of $l$ wrt to each state +The ODE has $N$ states, $y=(y_1, \ldots, y_N)$. During the reverse +sweep of reverse mode autodiff we are then given the autodiff adjoints +of $a_{l,y_n}$ wrt to the output state $y(T,p)$. These are the +partials of $l$ wrt to each state $$ a_{l,y_n} = \left.\frac{\partial l}{\partial y_n}\right|_{t=T} $$ @@ -268,12 +268,17 @@ The arguments are: hand side. The types ```T1, T2, ...``` can be any type, but they must match the types of the matching arguments of ```f```. -In addition a function version is made available which mimics the -existing ode interface for tolerances. This version will allow users -to quickly try out the new adjoint interface. **The extra control -parameters will be set based on best guesses as of now and are subject -to change**: - +In addition a simpler function version which mimics the existing ode +interface for tolerances should be made available once more experience +has been gained with the new ODE adjoint method. While at the current +stage it is premature to make such a simpler interface available, the +simpler interface is intended to let users quickly try out the new +adjoint interface. **As it is currently not clear how the extra +control parameters will be set based the function signature will not +be made available in the Stan language in the first version of the +adjoint ODE solver**. Nontheless, once sufficient experience has been +gained with the adjoint ODE method the following signature should be +made available: ```stan vector[] ode_adjoint_tol(F f, @@ -303,13 +308,20 @@ The best guesses based on preliminary experiments for the remaining tuning parameters are then set as - same relative tolerance in all problems -- `abs_tol_f = abs_tol / 100` -- `abs_tol_b = abs_tol / 10` +- `abs_tol_f = abs_tol / 10` +- `abs_tol_b = abs_tol / 3` - `abs_tol_q = abs_tol` - $250$ steps between checkpoints - bdf solver for forward and backward solve - hermite polynomial method for forward function approximation +The above best guesses are only based on early experiments and will +need to be refined. This design doc intentionally does not define when +"sufficient" experience has been collected until the simplified +function can be made available. However, the above defaults (or an +updated set) should be included in the documentation of the adjoint +ODE solver. + # Drawbacks [drawbacks]: #drawbacks From 266f10ec1c90a0fa5b1cb7456f755cbacbb6349d Mon Sep 17 00:00:00 2001 From: Sebastian Weber Date: Wed, 7 Apr 2021 21:15:32 +0200 Subject: [PATCH 11/14] doc some limitations --- designs/0027-adjoint-ode.md | 26 ++++++++++++++++++++++++-- 1 file changed, 24 insertions(+), 2 deletions(-) diff --git a/designs/0027-adjoint-ode.md b/designs/0027-adjoint-ode.md index 3250663..8030bca 100644 --- a/designs/0027-adjoint-ode.md +++ b/designs/0027-adjoint-ode.md @@ -230,7 +230,7 @@ The proposed function should have the following signature: vector[] ode_adjoint_tol_ctl(F f, vector y0, real t0, real[] times, - real rel_tol_f, vector abs_tol_f, + real relative_tolerance_forward, vector abs_tol_f, real rel_tol_b, vector abs_tol_b, real rel_tol_q, real abs_tol_q, int max_num_steps, @@ -246,7 +246,7 @@ The arguments are: 3. ```t0``` - Initial time of the ode solve 4. ```times``` - Sorted arary of times to which the ode will be solved (each element must be greater than t0) -5. ```rel_tol_f``` - Relative tolerance for forward solve (data) +5. ```relative_tolerance_forward``` - Relative tolerance for forward solve (data) 6. ```abs_tol_f``` - Absolute tolerance vector for each state for forward solve (data) 7. ```rel_tol_b``` - Relative tolerance for backward solve (data) @@ -332,6 +332,28 @@ ODE. This would allow for more efficient solvers and even larger systems, but this is not possible at the moment to figure out structurally the inter-dependencies of inputs and outputs. +Of note is that the adjoint ODE method is not efficient by +construction to calculate the gradient wrt to multiple outputs as +needed for a Jacobian. Since for each function output the gradient for +all parameters must be calculated by resetting the adjoints back to +zero and then rerunning the backward sweep of AD (chain method) once +again. Hence, this triggers for each function output gradient a rerun +of the backward integration problem, which is not needed for the +forward ODE method. + +The proposed function signature restricts the absolute tolerances for +the quadrature backward problem to be a scalar only as opposed to +allowing for a vector absolute tolerance which assigns to each +parameter qudarature problem it's own absolute tolerance. This +simplifcation is made for usability reasons and the emprical +observation that neither any CVODES example codes nor Julia's +DifferentialEquations.jl package does facilitate a vector absolute +tolerance for the backward quadrature integration. A user may work +around this limitation by scaling the states and the parameters which +will affect the order of the gradients accordingly and therefore +change the tolerance checks of CVODES whenever a gradient approaches +zero. + # Rationale and alternatives [rationale-and-alternatives]: #rationale-and-alternatives From 1fe6e663b7966ab8e1382e84455488942616b132 Mon Sep 17 00:00:00 2001 From: Sebastian Weber Date: Wed, 7 Apr 2021 21:16:46 +0200 Subject: [PATCH 12/14] align names with other ODE solvers --- designs/0027-adjoint-ode.md | 34 +++++++++++++++++----------------- 1 file changed, 17 insertions(+), 17 deletions(-) diff --git a/designs/0027-adjoint-ode.md b/designs/0027-adjoint-ode.md index 8030bca..4c8211e 100644 --- a/designs/0027-adjoint-ode.md +++ b/designs/0027-adjoint-ode.md @@ -230,13 +230,13 @@ The proposed function should have the following signature: vector[] ode_adjoint_tol_ctl(F f, vector y0, real t0, real[] times, - real relative_tolerance_forward, vector abs_tol_f, - real rel_tol_b, vector abs_tol_b, - real rel_tol_q, real abs_tol_q, + real relative_tolerance_forward, vector absolute_tolerance_forward, + real relative_tolerance_backward, vector absolute_tolerance_backward, + real relative_tolerance_quadrature, real absolute_tolerance_quadrature, int max_num_steps, int num_steps_between_checkpoints, int interpolation_polynomial, - int solver_f, int solver_b + int solver_forward, int solver_backward T1 arg1, T2 arg2, ...) ``` @@ -247,12 +247,12 @@ The arguments are: 4. ```times``` - Sorted arary of times to which the ode will be solved (each element must be greater than t0) 5. ```relative_tolerance_forward``` - Relative tolerance for forward solve (data) -6. ```abs_tol_f``` - Absolute tolerance vector for each state for +6. ```absolute_tolerance_forward``` - Absolute tolerance vector for each state for forward solve (data) -7. ```rel_tol_b``` - Relative tolerance for backward solve (data) -8. ```abs_tol_b``` - Absolute tolerance vector for each state backward solve (data) -9. ```rel_tol_q``` - Relative tolerance for backward quadrature (data) -10. ```abs_tol_q``` - Absolute tolerance for backward quadrature (data) +7. ```relative_tolerance_backward``` - Relative tolerance for backward solve (data) +8. ```absolute_tolerance_backward``` - Absolute tolerance vector for each state backward solve (data) +9. ```relative_tolerance_quadrature``` - Relative tolerance for backward quadrature (data) +10. ```absolute_tolerance_quadrature``` - Absolute tolerance for backward quadrature (data) 11. ```max_num_steps``` - Maximum number of time-steps to take in integrating the ODE solution between output time points for forward and backward solve (data) @@ -260,9 +260,9 @@ The arguments are: solution (data) 13. ```interpolation_polynomial``` can be 1 for hermite or 2 for polynomial interpolation method of CVODES -14. ```solver_f``` solver used for forward ODE problem: 1=Adams, +14. ```solver_forward``` solver used for forward ODE problem: 1=Adams, 2=bdf -15. ```solver_b``` solver used for backward ODE problem: 1=Adams, +15. ```solver_backward``` solver used for backward ODE problem: 1=Adams, 2=bdf 16. ```arg1, arg2, ...``` - Arguments passed unmodified to the ODE right hand side. The types ```T1, T2, ...``` can be any type, but they must match @@ -284,7 +284,7 @@ made available: vector[] ode_adjoint_tol(F f, vector y0, real t0, real[] times, - real rel_tol, real abs_tol, + real relative_tolerance, real absolute_tolerance, int max_num_steps, T1 arg1, T2 arg2, ...) ``` @@ -295,8 +295,8 @@ The arguments are: 3. ```t0``` - Initial time of the ode solve 4. ```times``` - Sorted array of times to which the ode will be solved (each element must be greater than t0) -5. ```rel_tol``` - Relative tolerance for all solves (data) -6. ```abs_tol``` - Absolute tolerance for all solves (data) +5. ```relative_tolerance``` - Relative tolerance for all solves (data) +6. ```absolute_tolerance``` - Absolute tolerance for all solves (data) 7. ```max_num_steps``` - Maximum number of time-steps to take in integrating the ODE solution between output time points for forward and backward solve (data) @@ -308,9 +308,9 @@ The best guesses based on preliminary experiments for the remaining tuning parameters are then set as - same relative tolerance in all problems -- `abs_tol_f = abs_tol / 10` -- `abs_tol_b = abs_tol / 3` -- `abs_tol_q = abs_tol` +- `absolute_tolerance_forward = absolute_tolerance / 10` +- `absolute_tolerance_backward = absolute_tolerance / 3` +- `absolute_tolerance_quadrature = absolute_tolerance` - $250$ steps between checkpoints - bdf solver for forward and backward solve - hermite polynomial method for forward function approximation From f67693308c64e3e51a9d3873854537cf6aa13a7b Mon Sep 17 00:00:00 2001 From: Sebastian Weber Date: Wed, 7 Apr 2021 21:30:26 +0200 Subject: [PATCH 13/14] fix --- designs/0027-adjoint-ode.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/designs/0027-adjoint-ode.md b/designs/0027-adjoint-ode.md index 4c8211e..91026b0 100644 --- a/designs/0027-adjoint-ode.md +++ b/designs/0027-adjoint-ode.md @@ -236,7 +236,7 @@ vector[] ode_adjoint_tol_ctl(F f, int max_num_steps, int num_steps_between_checkpoints, int interpolation_polynomial, - int solver_forward, int solver_backward + int solver_forward, int solver_backward, T1 arg1, T2 arg2, ...) ``` From fb171b982eb9a9533b25c1de0a08770f3bf0ecc1 Mon Sep 17 00:00:00 2001 From: Sebastian Weber Date: Thu, 15 Apr 2021 21:12:38 +0200 Subject: [PATCH 14/14] note that vector absoute tolerances can be retrofitted if deemed needed --- designs/0027-adjoint-ode.md | 13 +++++++++---- 1 file changed, 9 insertions(+), 4 deletions(-) diff --git a/designs/0027-adjoint-ode.md b/designs/0027-adjoint-ode.md index 91026b0..edcad32 100644 --- a/designs/0027-adjoint-ode.md +++ b/designs/0027-adjoint-ode.md @@ -349,10 +349,15 @@ simplifcation is made for usability reasons and the emprical observation that neither any CVODES example codes nor Julia's DifferentialEquations.jl package does facilitate a vector absolute tolerance for the backward quadrature integration. A user may work -around this limitation by scaling the states and the parameters which -will affect the order of the gradients accordingly and therefore -change the tolerance checks of CVODES whenever a gradient approaches -zero. +around this limitation to some extent by scaling the states and the +parameters which will affect the order of the gradients accordingly +and therefore change the tolerance checks of CVODES whenever a +gradient approaches zero. As the lack of the feature of a vector +absolute tolerance for the parameter quadratures may potentially pose +a limitation for future uses of the adjoint ODE solver, the +implementation in stan-math should be written in an manner allowing +for an inclusion of this feature at a later time (for example through +an overload of the proposed signature). # Rationale and alternatives [rationale-and-alternatives]: #rationale-and-alternatives