--- title: "hmde for Mathematicians" output: rmarkdown::html_vignette bibliography: vignette.bib vignette: > %\VignetteIndexEntry{hmde for Mathematicians} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} editor_options: chunk_output_type: console --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` This vignette is intended for a statistical/mathematical audience who are interested in Bayesian inverse problems. For biologists looking for an applications-based walkthrough the other vignettes in this package -- `hmde`, `constant-growth`, `von-bertalanffy`, and `canham` -- are less theoretical. # The Theory The underlying method that the hmde package implements leverages the longitudinal structure of repeat measurement data to estimate parameters for an underlying differential equation, as first demonstrated in \cite{obrien2024allindividuals}. This method is an example of a Bayesian inverse method for differential equation estimation: we are attempting to estimate the parameters of a chosen DE based on observations of the resulting process. We assume that the data consists of a finite number of discrete (likely sparse) observations with measurement error at the individual level, and have a set number of possible differential equations ready to be fit to data. ## The Maths In a general setting, we are interested in some quantity $Y(t)$ that changes over time (approximately) according to a chosen DE $f$. We have some finite finite number measurements at times $t_j$, and believe that the underlying true behaviour is given by $$ Y(t_{j+1}) = Y(t_j) + \int_{t_j}^{t_{j+1}} f(Y(t), \boldsymbol{\theta})\,dt\qquad (1) $$ for unseen parameter vector $\boldsymbol{\theta}$ that we wish to estimate. We also have an initial condition $Y_0 = Y(t_0)$. We have three in-built growth functions in `hmde`. - Constant: $f = \beta$ chosen as when you only have two observations the average growth rate is the best you can do. Furthermore, the results from the constant model align with a linear mixed effects model for size with individual parameter. The constant model is size-independent. As all numerical methods are equivalent and the same as the analytic solution given the initial condition, we use the Euler method for the constant model. - von Bertalanffy: $$f = \beta (Y_{max} - Y(t))\qquad\qquad(2)$$ which has a history of use in biology for species that grow to some maximum size, and represents a simple size-dependent linear function. If a power law model is desired, log-transforming observations and then back-transforming by exponentiation gives such a function. Equation (2) is implemented with the analytic solution rather than a numerical method, as it is a known nightmare example for numerical stability due to the negative coefficient on $Y(t)$. - Canham: $$ f = f_{max} \exp\Bigg(-\frac{1}{2}\bigg(\frac{\log(Y(t)/Y_{max})}{k} \bigg)^2 \Bigg),\qquad\qquad(3) $$ which is considered a reasonable approximation of long term growth behaviour for some tree species as shown in \cite{obrien2024allindividuals} and [Chapter 2]. Equation (3) is extremely non-linear and does not have an analytic solution, forcing the use of numerical methods in order to estimate the growth increments in Equation (1). - Linear: $$f = \beta_0 - \beta_1 Y(t)$$ which is only included for demonstration purposes of where numerical methods can go wrong as it is a re-parameterisation of the von Bertalanffy model. Choice of appropriate function is an exercise for the user and depends on the available data. Aside from the linear model all have versions that work with both a single individual, and multiple individuals. We provide example data intended for use with each of the primary models: - `Trout_Size_Data` for the constant model, - `Lizard_Size_Data` for the von Bertalanffy model, - `Tree_Size_Data` for the Canham model. ## The Stats We assume that we do not have access to $\boldsymbol{\theta}$ or the true values of $Y$ over time. Instead we have observations with measurement error, $$ y_j = Y(t_j) + \text{error}, $$ and estimate $\boldsymbol{\theta}$ with $\hat{\boldsymbol{\theta}}$. We use a hierarchical structure to encode different levels of relationships within the data. At the bottom of the hierarchy is the measurement level $$ y_j \sim \mathcal{N}(\hat{Y}(t_j), \sigma_e), $$ where we assume normally distributed error. This may not always be true, but \ref{obrien2024allindividuals} indicated that symmetric error centred at 0 may be enough for reasonable results. The longitudinal structure in Equation (1) serves as the next level, connecting estimated sizes over time based on the chosen function $f$ and estimated parameters $\hat{\boldsymbol{\theta}}$, which operates at the level of the individual. If the data has multiple individuals we add additional layers that act as hyper-parameters on the distributions of elements of each individual $i$'s $\hat{\boldsymbol{\theta}}_i$. We build these to be independent $\theta_{ki}$s, with log-mean and log-standard deviation parameters $$ \theta_k \sim \log\mathcal{N}(\mu_k, \sigma_k), $$ and typically use the following priors: $$ \mu_k \sim\mathcal{N}(0,1), \quad 0 < \sigma_k\sim Cauchy(0, 1). $$ The error parameter $\sigma_e >0$ is assumed to operate at a global level independent of individual and typically has a Cauchy prior with location 0, spread parameter 1. Estimation is done using MCMC through Stan. ## Integration of time series Numerical methods are required for the Canham model as it has no analytic solution, and the inbuilt Stan Runge-Kutta 4-5 solver is used. For the von Bertalanffy model an analytic solution is used in order to avoid numerical problems. For the constant model all numerical methods are the same and give the same result as the analytic solution so Euler is used. # Getting started with {hmde} 'hmde' is under active development, you can install the development version of 'hmde' from [GitHub](https://github.com/) with: ```{r, eval=FALSE} # install.packages("remotes") remotes::install_github("traitecoevo/hmde") ``` ```{r} library(dplyr) library(ggplot2) library(hmde) ``` # Demonstration: Canham Growth - Multiple Individuals The provided tree data for 50 individuals takes a few hours to run. As such, the following block does not run by default, and instead we leverage the provided estimates data file: `Tree_Size_Ests`. The estimates are posterior mean, median, and 95\% central credible intervals for parameters, and mean posterior estimates of sizes over time based on the longitudinal model. ```{r, eval=FALSE} # Build fit and extract estimates canham_multi_ind_fit <- hmde_model("canham_multi_ind") |> hmde_assign_data(data = Tree_Size_Data) |> hmde_run(chains = 1, cores = 1, iter = 1000) Tree_Size_Ests <- hmde_extract_estimates(model = "canham_multi_ind", fit = canham_multi_ind_fit, input_measurement_data = Tree_Size_Data) ``` The following code produces plots of the Canham function for each individual between the first and last estimated size. The purpose of the plot is to look at how different individual growth functions behave. ```{r} summary(Tree_Size_Ests) # Plot fitted growth function pieces plot_par_individual_data <- Tree_Size_Ests$individual_data[,c(1, 2, 6, 10)] #Pull out estimates only hmde_plot_de_pieces(model = "canham_multi_ind", individual_data = plot_par_individual_data, measurement_data = Tree_Size_Ests$measurement_data) ``` Each line represents 25 years of growth for the specific individual. Lines that sit lower on the $y$-axis are shorter horizontally because they are traversed more slowly, as $f$ is the rate of change in $Y$. To understand the error model, the following code plots a number of individual sizes over time, then super-imposes those growth curves in black on the individual growth function plots. To view a specific individual, use the `ind_id` value to select them. ```{r} #Plots of size over time for a sample of 5 individuals sample_ids <- sample(1:nrow(Tree_Size_Ests$individual_data), size=5) %>% sort() plot_data <- Tree_Size_Ests$measurement_data %>% filter(ind_id %in% sample_ids) ind_size_lims <- Tree_Size_Ests$measurement_data %>% filter(ind_id %in% sample_ids)%>% group_by(ind_id) %>% summarise(y_0 = min(y_hat), y_final = max(y_hat)) ggplot(data=plot_data, aes(group = ind_id)) + geom_point(aes(x = time, y=y_obs, colour = as.factor(ind_id)), shape = 1) + geom_line(aes(x = time, y=y_obs, colour = as.factor(ind_id)), linetype = "dashed") + geom_point(aes(x = time, y=y_hat, colour = as.factor(ind_id)), shape = 2) + geom_line(aes(x = time, y=y_hat, colour = as.factor(ind_id)), linetype = "solid") + labs(x="Time (years)", y="DBH (cm)", colour="Ind. ID") + theme_classic() #Load DE for Canham DE_function = hmde_model_des("canham_multi_ind") #Produce plot with focus inds function_plot <- hmde_plot_de_pieces(model = "canham_multi_ind", individual_data = plot_par_individual_data, measurement_data = Tree_Size_Ests$measurement_data, alpha = 0.2) for(i in 1:length(sample_ids)){ args_list <- list(pars=Tree_Size_Ests$individual_data[sample_ids[i],c(2, 6, 10)]) function_plot <- function_plot + geom_function(fun=DE_function, args=args_list, colour="black", linewidth=1, alpha = 1, xlim=c(ind_size_lims$y_0[i], ind_size_lims$y_final[i])) } function_plot ```