--- title: "Here be dragons" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Here be dragons} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} editor_options: chunk_output_type: console bibliography: vignette.bib --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = TRUE ) # job::job({ # knitr::knit("vignettes/here-be-dragons.Rmd.orig", "vignettes/here-be-dragons.Rmd") # }) options(rmarkdown.html_vignette.check_title = FALSE) ``` ## Load dependencies ```{r} #remotes::install_github("traitecoevo/hmde@a93e02e48657bb648378f584c4a2166d38fea263", force = TRUE) library(hmde) library(dplyr) library(ggplot2) library(deSolve) library(cowplot) ``` This vignette demonstrates an interaction between errors from numerical integration methods and MCMC sampling. The underlying issue is that if, given errors in the chosen numerical integration method, two sets of parameters for a differential equation $f$ give "the same" (more in a second) output for $$Y(t_{j+1}) = Y(t_j) + \int_{t_j}^{t_{j+1}} f(Y(t), \boldsymbol{\theta})\,dt\qquad (1)$$ the MCMC sampler will be unable to meaningfully distinguish the parameter combinations. Thus, there is a form of non-identifiability that arises from numerical errors in a longitudinal model based on Equation (1) that are separate to other issues of non-identifiability, and currently under-explored in the literature. In this demonstration we use a Runge-Kutta 4th order numerical method (\cite{butcher2016numerical}) with different step sizes to show that even 'small' step sizes can give problems. It is important to state that "the same" $Y(t)$ values in this context means within statistical error of each other. We assume that our data consists of observations of the form $y_{j}$ at time $t_j$ that look like $$y_j = Y(t_j) + \text{error},$$ and have some finite level of precision. The numerical method may produce estimated values $\hat{Y}(t_j)$ that differ by some amount that is much smaller than the level of precision or observation error for the different parameter combinations, but due to the imprecision of the measurement process the MCMC sampler cannot meaningfully distinguish the estimates. For this demonstration we will use rounding to produce simulated data with measurement precision of 0.1, and error of $\mathcal{N}(0, 0.1)$, analogous to the 1mm measurement precision and approximate standard deviation of the real-world source data used in @obrien2024allindividuals. ## The model We are implementing a longitudinal model of the form in Equation (1) within a hierarchical Bayesian longitudinal model where $$f(Y(t), \beta_0, \beta_1) = \beta_0 - \beta_1 Y(t)\qquad (2)$$ Equation (2) is known to produce pathological behaviour from numerical methods [@butcher2016numerical], so serves as an ideal simple example of the interaction between those pathologies and the MCMC sampling process. We are attempting to estimate the parameters $\beta_0$ and $\beta_1$ from observations $y_j$, which is based on estimating $\hat{Y}_j$ given the prior distribution $$y_j \sim \mathcal{N}(\hat{Y}_j, 0.1).$$ As we are looking at a single individual, we have prior distributions for the parameters which are $$\beta_k \sim \log\mathcal{N}(0,2),$$ and enforce $\beta_k > 0$. ## Simulating data For this demo code we use $\beta_0 = 10$ and $\beta_1 = 1$. If you wish to experiment with other values, input them in the next block and the rest will run based on that. We use the analytic solution to simulate true sizes over time, then add measurement error and round to the chosen measurement precision of 0.1 to give a sequence of observations over time that become the `y_obs` data for the model fit. Notice that `y_obs` can produces values that are bigger than the theoretical max value $\beta_0/\beta_1$ due to error. ```{r} #Change these values to change the model parameters. Must be positive values. beta_0 <- 10 beta_1 <- 1 #True initial condition true_y_0 <- 1 max_time <- 9 time <- 0:max_time #Analytic solution y_true <- (beta_0/beta_1) + (true_y_0 - (beta_0/beta_1)) * exp(-beta_1 * time) #Produce observations with error and limited precision y_obs <- round(y_true + rnorm(length(y_true), 0, 0.1), digits = 1) #Unrounded data if needed #y_obs <- y_true + rnorm(length(y_true), 0, 0.1) #Observed data frame obs_data_frame <- tibble( time = time, y_obs = y_obs, obs_index = 1:length(y_obs) ) #Have a look at the true and 'observed' data plot_data <- tibble( x = c(time, time), y = c(y_true,y_obs), source = c(rep("True sizes", times = length(y_true)), rep("Obs. sizes", times = length(y_obs))) ) ggplot(plot_data, aes(x = x, y = y, group = source)) + geom_point(aes(colour = source, shape = source)) + geom_line(aes(colour = source, linetype = source)) + theme_classic() + theme(legend.position = "inside", legend.position.inside = c(0.7, 0.3)) ``` A note on precision: the same bad behaviour occurs even if you use unrounded data with all the precision R offers. You can test this by uncommenting the line that just adds measurement error. We decided to round the observations in order to give a more realistic demonstration of the issue. ## Implementing model We're going to run 100 fits with a step size of 1 using the custom RK4 solver and a single chain each. We do single chains because each can fall into the numerical error trap, and the easiest way to identify that trap is to extract the estimates afterwards. Each fit takes a few seconds to run, so allow several minutes for the following block. There are likely to be diagnostic problems but that is part of what we are here to explore so we will be ignoring them. Divergent transitions in particular are to be expected when we have the very different parameter estimates we see. Results are hidden for this block. ```{r, results='hide'} runs <- 100 step_size = 0.5 par_est_tibble <- tibble(run = c(), step_size = c(), beta_0 = c(), beta_1 = c()) for(i in 1:runs){ #Run the model suppressWarnings( fit <- hmde_model("linear_single_ind") |> hmde_assign_data(n_obs = nrow(obs_data_frame), y_obs = obs_data_frame$y_obs, obs_index = obs_data_frame$obs_index, time = obs_data_frame$time, y_bar = mean(obs_data_frame$y_obs), step_size = step_size) |> hmde_run(chains = 1, cores = 1, iter = 2000) ) #Extract parameter estimates ests <- hmde_extract_estimates(model = "linear_single_ind", fit = fit, input_measurement_data = obs_data_frame) temp <- tibble( run = i, step_size = step_size, beta_0 = ests$individual_data$ind_beta_0_mean, beta_1 = ests$individual_data$ind_beta_1_mean ) par_est_tibble <- rbind(par_est_tibble, temp) } ``` For the purpose of further exploration, here's a bunch of additional step sizes that will concatenate to the existing parameter tibble. This block does not run by default because as the step size decreases the runtime increases, and with 100 runs per step size this can take a while. ```{r, eval = FALSE} step_sizes = c(0.25, 0.125) for(j in 1:length(step_sizes)){ print(paste0("Fits for step size ", step_sizes[j])) for(i in 1:runs){ print(paste0("Run: ", i, " step size: ", step_sizes[j])) #Run the model fit <- hmde_model("linear_single_ind") |> hmde_assign_data(n_obs = nrow(obs_data_frame), y_obs = obs_data_frame$y_obs, obs_index = obs_data_frame$obs_index, time = obs_data_frame$time, y_bar = mean(obs_data_frame$y_obs), step_size = step_sizes[j]) |> hmde_run(chains = 1, cores = 1, iter = 2000) #Extract parameter estimates ests <- hmde_extract_estimates(model = "linear_single_ind", fit = fit, input_measurement_data = obs_data_frame) temp <- tibble( run = i, step_size = step_sizes[j], beta_0 = ests$individual_data$ind_beta_0_mean, beta_1 = ests$individual_data$ind_beta_1_mean ) par_est_tibble <- rbind(par_est_tibble, temp) } } ``` ## Analysis Let's extract some summary statistics and look at the distribution of parameter estimates. First we'll get histograms of the 1-dimensional distributions, then we'll look at a scatter plot. ```{r} error_ests <- tibble(error_beta_0 = c(), error_beta_1 = c(), step_size = c(), count = c(), fraction = c(), dist = c()) beta_0_plot_list <- list() beta_1_plot_list <- list() scatterplot_errors_only <- list() colours <- c("#609cff", "#00ba38", "#f8766d") shapes <- c(15, 17, 19) for(i in 1:length(unique(par_est_tibble$step_size))){ step_size_select <- unique(par_est_tibble$step_size)[i] plot_data <- par_est_tibble %>% filter(step_size == step_size_select) error_estimates_temp <- plot_data %>% filter(beta_0 > 20) %>% summarise(error_beta_0 = median(beta_0), error_beta_1 = median(beta_1), step_size = mean(step_size), count = n(), fraction = n()/nrow(plot_data) ) dist_table <- tibble( b_0 = c(error_estimates_temp$error_beta_0, 10), b_1 = c(error_estimates_temp$error_beta_0, 1) ) error_estimates_temp$dist <- dist(dist_table) error_ests <- rbind(error_ests, error_estimates_temp) beta_0_plot_list[[i]] <- ggplot(plot_data, aes(beta_0)) + geom_histogram(bins = 25, colour = "black", fill = "lightblue") + geom_vline(xintercept = beta_0, linetype = "dashed", linewidth = 1, colour = "black") + geom_vline(xintercept = error_ests$error_beta_0[i], linetype = "dotted", linewidth = 1, colour = "cornflowerblue") + xlim(0, max(par_est_tibble$beta_0)) + labs(x = "beta_0 est.", title = paste0("Step size: ", step_size_select)) + theme_classic() beta_1_plot_list[[i]] <- ggplot(plot_data, aes(beta_1)) + geom_histogram(bins = 25, colour = "black", fill = "lightgreen") + geom_vline(xintercept = beta_1, linetype = "dashed", linewidth = 1) + geom_vline(xintercept = error_ests$error_beta_1[i], linetype = "dotted", linewidth = 1, colour = "forestgreen") + xlim(-1, max(par_est_tibble$beta_1)) + labs(x = "beta_1 est.", title = paste0("Step size: ", step_size_select)) + theme_classic() error_ests_scatter <- plot_data %>% filter(beta_0 > 20) xpos <- (min(error_ests_scatter$beta_0) + 0.2*(max(error_ests_scatter$beta_0) - min(error_ests_scatter$beta_0))) ypos <- (max(error_ests_scatter$beta_1) - 0.1*(max(error_ests_scatter$beta_1) - min(error_ests_scatter$beta_1))) scatterplot_errors_only[[i]] <- ggplot(data = error_ests_scatter, aes(x = beta_0, y = beta_1)) + geom_point(colour = colours[i], shape = shapes[i], alpha = 0.5, size = 2) + labs(x = "beta_0 est.", y = "beta_1 est.", title = paste0("Step size: ", step_size_select)) + annotate("text", x = xpos, y = ypos, label = paste0("Error rate: \n", error_estimates_temp$fraction)) + theme_classic() } error_ests ``` We get clear bimodality in both the histograms and the scatter plot. Bias in the estimates closest to the true values is due to the same measurement error in the 'observed' data for all the fits. The second mode in the estimates arises from the numerical integration error. We can double-check the numerical error by using an independent solver with the same step size. We use the `deSolve` package which has an implementation of RK4 and allows us to choose the step sizes using the time parameter. ```{r} #install.packages("deSolve") library(deSolve) #Create DE function DE <- function(Time, State, Pars) { #Implementation of DE with(as.list(c(State, Pars)), { dY <- beta_0 - beta_1 * Y return(list(c(dY))) }) } yini <- c(Y = true_y_0) #Initial condition y_over_time <- tibble(model="True Sizes", y_hat = y_true, time = 0:max_time ) #Generate y(t) with RK4 given the parameter estimates for(i in 1:nrow(error_ests)){ pars_combo <- c(beta_0 = error_ests$error_beta_0[i], beta_1 = error_ests$error_beta_1[i]) times <- seq(0, max_time, by = error_ests$step_size[i]) numerical_output <- ode(yini, times, DE, pars_combo, method = "rk4") y_over_time_temp <- tibble( model = error_ests$step_size[i], y_hat = numerical_output[,2], time = times ) y_over_time <- rbind(y_over_time, y_over_time_temp) } ``` For plotting purposes, we're going to filter out the numerical estimates that occur at sub-steps of the observation times. ```{r} y_over_time_filtered <- y_over_time %>% filter(time %in% 0:max_time) #Plot sizes over time compare_sizes_over_time <- ggplot(y_over_time_filtered, aes(x=time, y=y_hat, group_by = as.factor(model))) + geom_point(aes(colour = as.factor(model), shape = as.factor(model)), alpha=0.5, size = 2, stroke = 1.5) + geom_line(aes(colour = as.factor(model)), alpha=0.5, linewidth = 1) + labs(x = "Time", y = "Y(t)", title = "Fitted step size", colour = "Step size", shape = "Step size") + theme_classic() + theme(legend.position = "inside", legend.position.inside = c(0.7, 0.3)) compare_sizes_over_time ``` What we're interested in here is that despite the wildly different parameter estimates, there's strong alignment between the sizes over time due to the numerical error. Here's a plot that uses a much smaller step size for all of the parameter combos to show the error. ```{r} #Generate y(t) with RK4 given the parameter estimates y_over_time_smallstep <- tibble(model="True Sizes", y_hat = y_true, time = 0:max_time ) for(i in 1:nrow(error_ests)){ pars_combo <- c(beta_0 = error_ests$error_beta_0[i], beta_1 = error_ests$error_beta_1[i]) times <- seq(0, max_time, by = 0.001) numerical_output <- ode(yini, times, DE, pars_combo, method = "rk4") y_over_time_temp <- tibble( model = error_ests$step_size[i], y_hat = numerical_output[,2], time = times ) y_over_time_smallstep <- rbind(y_over_time_smallstep, y_over_time_temp) } point_data <- y_over_time_smallstep %>% filter(time %in% 0:max_time) #Plot sizes over time compare_sizes_over_time_smallstep <- ggplot(y_over_time_smallstep, aes(x=time, y=y_hat, grouping = as.factor(model))) + geom_line(aes(colour = as.factor(model)), alpha=0.5, linewidth = 1) + geom_point(data = point_data, aes(colour = as.factor(model), shape = as.factor(model)), alpha=0.5, size = 2, stroke = 1.5) + labs(x = "Time", y = "Y(t)", title = "Small step size", colour = "Step size", shape = "Step size") + theme_classic() + theme(legend.position = "inside", legend.position.inside = c(0.7, 0.3)) compare_sizes_over_time_smallstep error_ests$Y_max <- error_ests$error_beta_0/error_ests$error_beta_1 plot_data <- tibble( x = c(0, (beta_0/beta_1), rep(0, times = nrow(error_ests)), error_ests$Y_max), y = c(beta_0, 0, error_ests$error_beta_0, rep(0, times = nrow(error_ests))), step_size = c("True", "True", error_ests$step_size, error_ests$step_size) ) #Plot DEs error_de_plot <- ggplot(data = plot_data, aes(x,y)) + geom_line(aes(colour = as.factor(step_size), linetype = as.factor(step_size)), linewidth = 1) + scale_linetype_manual(values = c("longdash", "dashed", "dotted", "solid")) + labs(x = "Y(t)", y = "f", colour = "Step size", linetype = "Step size") + theme_classic() + theme(legend.position = "inside", legend.position.inside = c(0.7, 0.7)) ``` The limiting behaviour at $\beta_0/\beta_1$ is consistent, what changes is how fast line approaches the asymptote. If you have done multiple step sizes, this block will produce a scatter plot of all the parameter estimates together. We can see a very strong linear trend (slope is 1/10, which corresponds to the inverse limiting size), with the gap between the true parameter and the erroneous one getting larger as the step size shrinks. Also important to note that the frequency of bad estimates decreases along with the step size, which we believe comes from the greater distance: the further the bad estimates are from the true ones, the less likely the MCMC algorithm is to fall into them, particularly as the priors are close to the true values. ```{r, eval = FALSE} est_scatter <- ggplot(data = par_est_tibble, aes(x = beta_0, y = beta_1)) + geom_point(aes(colour = as.factor(step_size), shape = as.factor(step_size)), alpha = 0.5, size = 2) + labs(colour = "Step size", shape = "Step size") + theme_classic() + theme(legend.position = "inside", legend.position.inside = c(0.2, 0.8)) ``` The following block uses the `plot_grid` function to produce arranged figures and is set up to use with 3 step sizes. ```{r, eval=FALSE} #Histograms plot_grid(beta_0_plot_list[[1]], beta_0_plot_list[[2]], beta_0_plot_list[[3]], beta_1_plot_list[[1]], beta_1_plot_list[[2]], beta_1_plot_list[[3]], nrow = 3, byrow = FALSE, align = "v") #DE, all parameters, and sizes over time plot_grid( error_de_plot, est_scatter, compare_sizes_over_time, compare_sizes_over_time_smallstep, nrow = 2 ) #Erronious parameter distributions plot_grid( scatterplot_errors_only[[1]], scatterplot_errors_only[[2]], scatterplot_errors_only[[3]], nrow = 1 ) ``` # Where to from here? For the purpose of the hmde package that this vignette is a part of, we account for the pathologies in this particular model by using the analytic solution for the von Bertalanffy equation. More work needs to be done to understand the interaction between numerical methods and MCMC sampling. What we have demonstrated is that the problem exists, and it is not enough to have numerical stability at the true parameter values because MCMC estimation moves around, you need numerical stability in a potentially quite large part of the parameter space. The good news is that simulated data and posterior plots with more accurate numerical methods can at least identify that something is going wrong.