Code
::p_load(
pacman
here, MCMCglmm, metafor,
phytools, brms, lme4,
lmerTest, nlme, glmmTMB,
phylolm, ape, MASS,
tidyverse, cmdstanr,
posterior, bayesplot, flextable
)
options(digits = 3, scipen = 5)
::p_load(
pacman
here, MCMCglmm, metafor,
phytools, brms, lme4,
lmerTest, nlme, glmmTMB,
phylolm, ape, MASS,
tidyverse, cmdstanr,
posterior, bayesplot, flextable
)
options(digits = 3, scipen = 5)
Our case study revolves around a common question in comparative analysis - what drive brain size evolution? We simplify this question to a more tractable one - the allometric relationship between brain size and body mass both between and within species. We use data coming from multiple species to explore the way non-phylogenetic between-species variation affect the studied relationship. So the overarching relationship we’re interested in is What is the slope of brain-body size allometric relationship? Traditionally, this question was answered using the following strategy:
The modifications we are postulating - and gradually developing below - are as follows:
The code chunk below contains all the parameters set for subsequent simulations of the toy dataset. You can ignore this (parameters are explained later in relevant places). If you want to modify the simulation you can modify the parameters here (globally) or insert specific assignments in relevant chunks.
Since we are simulating a body-size dependent trait - we chose a brain vs. body size relationship in a small group of birds (e.g. corvids) – we should use the log transform each trait to capture size allometry. However, for simplicity, we will consider a small range of body masses within one order of magnitude, which makes the relationship nearly linear. The allometric relationship (assuming an allometric slope of corvids of 0.66) is \(m_{brain} = A\cdot m_{body}^{0.66}\). Our average corvid weighs 1000 g and has a brain weighing 15 g. Assuming a linear rather than power relationship means \(A\) can be approximated as \(A=15/(1000^{0.66})\cong0.15\). From there we can approximate (roughly - as we always do in simulations) the slope of linear brain ~ weight
relationship: \(slope = (A\cdot1200^{0.66}-A\cdot800^{0.66})/(1200-800)\cong0.01\) (in a reasonable and approximately linear range of 800-1200 g).
As for the mass allometry slope: to simulate it only across species, we use the simplest linear model \(y_{ij} = \beta_{0} + \beta_{all} \cdot \bar{x}_{weight,i} \: + \: ...\); to differentiate within- and between-species trends we can use within-subject centered version of the model: \(y_{ij} = \beta_{0} + (\beta_{b} - \beta_{w}) \cdot \bar{x}_{weight,i} \: + \: \beta_{w} \cdot x_{weight,ij} + \: ...\)
You can review all parameters in Tab. 1 .
Notational convention Throughout our paper and simulations below, we reserve \(t\) to denote individual trait values, to differentiate them from effecti sizes derived from literature in pure meta-analysis (\(y\)).
In case of multi-level models (e.g., community-level analysis or function-valued approach) the consistency of notation may be less clear. Parameters of functional curves (e.g., \(f(x|\alpha, \beta)\)) in FVT approaches are used as individual “traits” (\(t^{(1)}\) and \(t^{(2)}\) in the main text). For comunity level analysis, the “trait” (considered at the species level) is denoted as \(t\), and it serves as a response for estimation of the slope parameter of interest (i.e., \(b_1\), defined in \(f(x|b_0, b_1)\), estimated for each of \(m\) communities arranged in a spatial grid). Those community-level metrics are then analyses as responses in a higher level model, estimating their dependence on inter-community predictors (\(\mathbf{X}_c\); see main text for details), and controlling for spatial autocorrelations (\(\mathbf{\epsilon}_c\)). See Appendix 2 for more details about a two-level community-based analysis.
<- 100 # number of species
S <- 10 # number of effect size (study-specific) estimates in species i. Each study reports data on both sexes (males and females).
si # tree simulation
<- 0.8
birth <- 0.4
death
# gamma distribution of sampling variances
<- 2
alpha <- 20
beta
# fixed effects
<- 15 # intercept of response (= average brain size since weight is mean-centered)
beta0 <- 0.005 # slope of weight dependency (see rationale above)
beta1_all
<- -0.0023 # weaker within-species mass allometry
beta1_w <- beta1_all - beta1_w #excess of between-species slope over within-species slope
beta1_b # these two slopes can be used if assuming different within and between species slopes
# but no random slope variation due to phylogeny
<- 1.5 # by how much are males bigger in grams
beta2
# variance parameters
<- 1^2
sigma2_s <- 1.2^2
sigma2_e <- 0.65 # phylogenetic heritability
h2 <- (h2 * (sigma2_s + sigma2_e)) / (1 - h2)
sigma2_p <- sqrt(sigma2_s + sigma2_p + sigma2_e) # total SD ~ 2 (sensible range of brain masses 11-19 g)
SD_total
# body weight average assumed to be zero (zero-mean centered)
<- 0 # in reality 1000 g
mu_weight <- 50^2 # within-species variance in body weights in grams (SD=50g)
sigma2_ws <- 250^2 # between-species SD in body masses (defines the range of body weight values between species) sigma2_bs
Parameter | Value | Comment | Units |
---|---|---|---|
alpha (\(\alpha\)) |
2 | Alpha parameter for gamma distribution of sampling errors | n/a |
beta (\(\beta\)) |
20 | Beta parameter for gamma distribution of sampling errors | n/a |
birth |
0.8 | Speciation parameter for r[hylo() |
n/a |
death |
0.4 | Extinction parameter for rphylo() |
n/a |
beta0 (\(b_{0}\)) |
15 | Intercept of model (average brain size) | g |
beta1_all (\(b_{1}\)) |
0.005 | Allometric slope of brain ~ weight relationship |
unitless (g/g) |
beta1_b (\(b_{1b}\)) |
0.007 | Between-species component of allometric slope | unitless (g/g) |
beta1_w (\(b_{1w}\)) |
-0.002 | Within-species component of allometric slope | unitless (g/g) |
beta2 (\(b_{2}\)) |
1.5 | Sex weight difference (males - females) | g |
sigma2_s (\(\sigma_{s}^2\)) |
1 | Non-phylogenetic between-species variance | g2 |
sigma2_e (\(\sigma_{e}^2\)) |
1.44 | Residual (error) variance | g2 |
sigma2_p (\(\sigma_{p}^2\)) |
4.531 | Phylogenetic between-species variance | g2 |
h2 (\(h^2\)) |
0.65 | Phylogenetic heritability (proportion of variance explained by phylogeny) | unitless |
mu_weight (\(\mu_{w}\)) |
0 | Mean body weight (by definition = 0 - body weight is mean centered) | g |
sigma2_ws (\(\sigma_{ws}^2\)) |
2500 | Within-species variance in body weight (amounts to an SD = 50 g)) | g2 |
sigma2_bs (\(\sigma_{bs}^2\)) |
62500 | Between-species variance in body weight averages (amounts to an SD = 250 g)) | g2 |
We will use a simple linear model to simulate data that have several properties:
We will use the following linear model:
\[ \mathbf{t} = \mathbf{X}\mathbf{b} \:+\: \mathbf{Z_{s}}\mathbf{s} \:+\: \mathbf{Z_{p}}\mathbf{p} \:+\: \mathbf{m} \:+\: \mathbf{e}, \]
Distributional assumptions: \(\begin{aligned} \\ &\mathbf{s} \sim \mathcal{N}(\mathbf{0}, \sigma^2_{s}\mathbf{I}) \\ &\mathbf{p} \sim \mathcal{N}(\mathbf{0}, \sigma^2_{p}\mathbf{A}) \\ &\mathbf{m} \sim \mathcal{N}(\mathbf{0}, \sigma^2_{m} \mathbf{M} ); \: \sigma^2_{m} := 1 \\ &\mathbf{e} \sim \mathcal{N}(\mathbf{0}, \sigma^2_{e}\mathbf{I}) \end{aligned}\)
or in an observation-level form (as in the main manuscript text):
\[ \begin{aligned} t_{jk} & = \mathbf{x_{jk}}^{t}\mathbf{b} \:+\: s_{j} \:+\: p_{j} \:+\: m_{jk} + e_{jk} = \\ & =b_{0} \:+\: b_{1}x_{weight} \:+\: b_{2}x_{sex} \:+\: s_{j} \:+\: p_{j} \:+\: m_{jk} + e_{jk}. \end{aligned} \]
We assume \(jk \in 1, ..., n\) observations (n = 10 $ $ 100) and \(j \in 1, ..., S\) species (S = 100). In the model, the response variable \(\mathbf{t}_{n \times 1}\) is modelled as a linear combination of several components. \(\mathbf{X}_{n \times 3}\mathbf{b}_{3 \times 1}\) defines the influence of two fixed effects (\(\mathbf{b}\) also contains the intercept of the model (\(b_0\)), which in phylogenetically-structured data is the estimate of the ancestral root value \(\beta_0\)).
Both random terms \(\mathbf{Z}_{n \times 100}\mathbf{u}_{100 \times 1}\) (\(\mathbf{u}\) standing for \(\mathbf{s}\) and \(\mathbf{p}\)) define the assignment of each data point to each of the species in our set (the only difference is that the non-phylogenetic species effect is sampled from a distribution assuming \(\sigma_{s}^2\) variance and no dependence between species, whereas phylogenetic effect is sampled assuming phylogenetic variance of \(\sigma_{p}^2\) and a structured covariance matrix (based on \(\mathbf{C}\), derived from the phylogenetic variance-covariance matrix)).
Note, that the quantity \(\sigma^2_{p}/(\sigma^2_{s}+\sigma^2_{p}+\sigma^2_{e})\) is commonly known as phylogenetic signal, or phylogenetic heritability.
In each (species) cluster we will assume to have information coming from 5 males and 5 females, so the i-th species section of the \(\mathbf{X}\) will be (assuming regression coefficients to be defined as \((b_{0}, \: b_{1b}, \: b_{1w}, \: b_{sex})'\)):
For now we’re ignoring the study-level dependency, so no “study ID” random term - but it could be easily added to the model.
\[ \mathbf{X_{i}} = \begin{pmatrix} 1 & x_{weight,i} & x_{weight,i1} & 0 \\ 1 & x_{weight,i} & x_{weight,i2} & 0 \\ 1 & x_{weight,i} & x_{weight,i3} & 0 \\ 1 & x_{weight,i} & x_{weight,i4} & 0 \\ 1 & x_{weight,i} & x_{weight,i5} & 0 \\ 1 & x_{weight,i} & x_{weight,i1} & 1 \\ 1 & x_{weight,i} & x_{weight,i2} & 1 \\ 1 & x_{weight,i} & x_{weight,i3} & 1 \\ 1 & x_{weight,i} & x_{weight,i4} & 1 \\ 1 & x_{weight,i} & x_{weight,i5} & 1 \\ \end{pmatrix}. \]
This \(\mathbf{X}\) matrix assumes that we have average (per-species) body weights, and also within-species variation in individual weight measurements.
Similarly, each \(\mathbf{Z_{.}}\):
\[ \mathbf{Z_{.}} = \begin{pmatrix} 1 & 0 & \cdots & 0 \\ 1 & 0 & \cdots & 0 \\ 1 & 0 & \cdots & 0 \\ 1 & 0 & \cdots & 0 \\ 1 & 0 & \cdots & 0 \\ 1 & 0 & \cdots & 0 \\ 1 & 0 & \cdots & 0 \\ 1 & 0 & \cdots & 0 \\ 1 & 0 & \cdots & 0 \\ 1 & 0 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 1 & \cdots & 0 \\ 0 & 1 & \cdots & 0 \\ 0 & 1 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & 1 \end{pmatrix}. \]
The distribution of \(m_{ij}\) is usually assumed to be known (hence \(\sigma^2_{m}\) is fixed and not estimated); the diagonal elements of \(\mathbf{M}\) contain sampling variances of each individual study (derived using known properties of the effect size chosen, or assumed to be equal to the square of standard error (SE) of the published effect size). In our simulations we will assume that these are, again, known (we will sample relevant sampling variances only once) and we will sample respective \(m_{ij}\) only once at the beginning of simulation.
Note that in meta-analysis we often refer to weights \(w_{ij}\) that are used to give different importance to studies with varying sampling variance. Formally weights (inverses of sampling variances) form the diagonal of \(\mathbf{W} = \mathbf{M}^{-1}\)} - a matrix involved in the GLS estimator equivalent to a weighted regression (that assumes independent but heteroscedastic residual variances).
Let’s generate a hypothetical set of 100 species and simulate a random phylogeny that will be used in subsequent analyses. First we will produce 100 species labels:
<- as.vector(
species_names outer(LETTERS[1:10], LETTERS[11:20], FUN = paste0)
)
Now, simulate the tree using a simple birth-death process and a splitting bifurcating tree with a set number of species (n = 100
). The tree will be of class phylo
. Afterwards, we transform it into a phylogenetic variance-covariance matrix. The tree is generated assuming the birth rate of 0.8 and death rate (extinction) of 0.4 (Fig 1).
<- rphylo(
mytree n = 100, birth = birth, death = death, T0 = 100,
fossils = FALSE
)$tip.label <- species_names
mytreeplot(mytree, cex = 0.5)
<- vcv.phylo(mytree, corr = TRUE) C
The \(\mathbf{C}\) matrix is scaled to a total tree length of 1, which makes it a correlation matrix.
The sampling errors will be drawn from a gamma distribution. We have 600 observations (100 species, 6 effect-sizes each), we will assume \(\text{Gamma}(\alpha,\beta)\) with \(\alpha\) (shape
) equal to 2 and \(\beta\) (rate
) equal to 20.
# set.seed(12345) # to fix it in all instances to the same value
<- rgamma(100, shape = alpha, rate = beta)
v <- diag(v, nrow = length(v), ncol = length(v)) # the M matrix of sampling variances
M
# deviations resulting from the sampling process
<- mvrnorm(n = 1, mu = rep(0, length(v)), Sigma = M) m
To construct the model we have to construct relevant design matrices for fixed and random effects. Fixed effects are easy - we need a vector of 1’s (intercept), bound with a vector of species-specific body weights and a vector of 0’s/1’s assigning the sex of each observation. Let’s assume observations are ordered and grouped according to the species. To simplify, we assume average species body weights are sampled from a normal distribution with mean 0 (mean-centred data) and SD sqrt(sigma2_bs)
. Last column of X
contains deviations of individual within-species body weight measurements.
<- rnorm(S, mean = mu_weight, sd = sqrt(sigma2_bs))
species_weights <- cbind(
X rep(1, S * si),
rep(species_weights, each = si),
rep(c(rep(0, si / 2), rep(1, si / 2)), times = S)
)<- cbind(X, rep(species_weights, each = si) +
X2 rnorm(S * si, mean = 0, sd = sqrt(sigma2_ws)))
1:14, ] # the first 14 rows X[
A section from such X
looks like this:
1:14, ] # the first 14 rows X2[
[,1] [,2] [,3] [,4]
[1,] 1 33.9 0 75.1
[2,] 1 33.9 0 25.8
[3,] 1 33.9 0 40.7
[4,] 1 33.9 0 97.5
[5,] 1 33.9 0 24.6
[6,] 1 33.9 1 -13.9
[7,] 1 33.9 1 86.4
[8,] 1 33.9 1 98.6
[9,] 1 33.9 1 126.9
[10,] 1 33.9 1 -67.8
[11,] 1 349.3 0 225.6
[12,] 1 349.3 0 369.5
[13,] 1 349.3 0 324.5
[14,] 1 349.3 0 353.3
Z-matrices are simple (they should be made of blocks of six 1’s assembled along the diagonal: \(\mathbf{Z_{.}}=\mathbf{I}_{100 \times 100} \otimes \mathbf{1}_{10}\)). The easiest way of producing such matrix is:
<- kronecker(diag(S), rep(1, si))
Z 1:14, 1:3] # the first 14 rows for first 3 species Z[
[,1] [,2] [,3]
[1,] 1 0 0
[2,] 1 0 0
[3,] 1 0 0
[4,] 1 0 0
[5,] 1 0 0
[6,] 1 0 0
[7,] 1 0 0
[8,] 1 0 0
[9,] 1 0 0
[10,] 1 0 0
[11,] 0 1 0
[12,] 0 1 0
[13,] 0 1 0
[14,] 0 1 0
We can now combine everything in data-generating process (directly related to the linear model):
<- c(beta0, beta1_b, beta2, beta1_w)
betas2
<- rnorm(n = S, mean = 0, sd = sqrt(sigma2_s))
u_s <- mvrnorm(n = 1, mu = rep(0, S), Sigma = C)
u_p <- rnorm(S * si, mean = 0, sd = sqrt(sigma2_e))
e <- X2 %*% betas2 + Z %*% u_s + Z %*% u_p + m + e y2
Let’s assemble the data with other variables. From now on we will use a response generated with varying within-species slopes (y2
) and both sets of weights (study specific weight_sp
and species-averaged weight_avg
).
<- data.frame(
simdata response = y2, # Average 'brain size'
weight_avg = X[, 2], # Mass / weight (g) of each species (average, mean-centred)
weight_sp = X2[, 4], # Mass / weight (g) of each species (study-level, mean-centred)
v = v, # sampling variances
sex = as.factor(X[, 3]), # Sex of sample used to estimate 'brain size' average
species = rep(species_names, each = si), # species names (each has 5 associated studies, male and female values in both)
phy = rep(species_names, each = si), # repeat of species name for phylogeny
obs = 1:length(y2) # an observation-level random effect grouping
)
levels(simdata$sex) <- c("F", "M")
head(simdata)
response weight_avg weight_sp v sex species phy obs
1 16.4 33.9 75.1 0.32295 F AK AK 1
2 15.5 33.9 25.8 0.05532 F AK AK 2
3 16.4 33.9 40.7 0.04548 F AK AK 3
4 16.7 33.9 97.5 0.06973 F AK AK 4
5 17.9 33.9 24.6 0.14968 F AK AK 5
6 17.5 33.9 -13.9 0.00938 M AK AK 6
Assume that we have collected these data from various sources from the literature (field guides, published papers). Our final data set is now ready to be analysed using ‘meta-comparative’ modelling approaches (i.e., multilevel phylogenetic meta-analyses).
The data has approximately normal distribution:
ggplot(data = simdata, aes(x = response)) +
geom_histogram(fill = "black") +
theme_classic() +
theme(text = element_text(size = 15)) +
labs(x = "Response", y = "Count")
There’s also considerable sexual dimorphism, as well as some sexual dimorphism in weight correlation:
ggplot(data = simdata, aes(x = sex, y = response)) +
geom_boxplot(fill = "black", lwd = 1) +
theme_classic() +
theme(text = element_text(size = 15)) +
labs(x = "Sex", y = "Response")
ggplot(data = simdata, aes(x = weight_sp, y = response)) +
geom_point(aes(col = species), size = 1) +
geom_smooth(aes(col = species), method = "lm", se = FALSE, lwd = 0.8) +
geom_smooth(method = "lm", colour = "black") +
theme_classic() +
theme(text = element_text(size = 25), legend.position = "none") +
# scale_discrete_manual(aesthetics = "colour", values = c("purple", "green3")) +
labs(x = "Weight", y = "Response", col = "Sex")
If we ignore phylogenetic clustering and measurement error - we could apply the simplest method (e.g. test only for linear trend or only for sex differences, treating all points as independent observations) via the lm
function:
<- lm(response ~ weight_sp, data = simdata)
model_lm1 summary(model_lm1)
Call:
lm(formula = response ~ weight_sp, data = simdata)
Residuals:
Min 1Q Median 3Q Max
-5.737 -1.423 -0.018 1.315 7.678
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 16.46048 0.06734 244.4 <2e-16 ***
weight_sp 0.00453 0.00025 18.1 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.13 on 998 degrees of freedom
Multiple R-squared: 0.247, Adjusted R-squared: 0.247
F-statistic: 328 on 1 and 998 DF, p-value: <2e-16
We get a pretty unbiased estimate of both the weight trend and the intercept, and the weight slope is significant. It obviously misses the fact that in most species the allometric slope is wildly different from the overall one. However, the power of this simple model to detect a weight relationship may be inflated (in other words we may have downwardly biased Type II error rate):
<- replicate(
powersim_lm1 n = 1000,
summary(
lm(
rnorm(S * si,
mean = model_lm1$coefficients["(Intercept)"] +
$coefficients["weight_sp"] * simdata$weight_sp,
model_lm1sd = sd(residuals(model_lm1))
~ simdata$weight_sp
)
)$coef[2, 4]
)
)
sum(powersim_lm1 < 0.05)/length(powersim_lm1) # power = 100%
[1] 1
This model also returns downwardly biased estimate of variance (after removing fixed effects) - compare it to the overall value introduced in simulation: 6.971.
var(residuals(model_lm1))
[1] 4.52
Comparative analyses typically have a single mean value for each species. In this case, a single mean for each sex within each species (given that we’re interested in sex). Sampling error is often ignored in this process. Lets conduct a PGLS
analysis with this data structure. We’ll assume a Brownian model of evolution, but other evolutionary models could be used.
Let’s adjust our simulated data:
<- simdata %>%
simdata_pgls group_by(species, sex) %>%
summarise(response = mean(response), weight = mean(weight_avg)) %>%
data.frame()
head(simdata_pgls)
species sex response weight
1 AK F 16.6 33.9
2 AK M 18.2 33.9
3 AL F 13.5 -314.9
4 AL M 14.3 -314.9
5 AM F 13.1 -172.9
6 AM M 15.4 -172.9
We now have a data set that has effectively averaged the mean brain size and mass across populations for males and females. We can now proceed to fit the PGLS
model.
<- gls(response ~ weight + sex, correlation = corBrownian(phy = mytree, form = ~species),
model_pgls data = simdata_pgls, method = "ML")
summary(model_pgls)
Generalized least squares fit by maximum likelihood
Model: response ~ weight + sex
Data: simdata_pgls
AIC BIC logLik
-6746 -6733 3377
Correlation Structure: corBrownian
Formula: ~species
Parameter estimate(s):
numeric(0)
Coefficients:
Value Std.Error t-value p-value
(Intercept) 15.94 0.488 32.7 0
weight 0.00 0.000 13.8 0
sexM 1.69 0.090 18.8 0
Correlation:
(Intr) weight
weight -0.049
sexM 0.022 -0.009
Standardized residuals:
Min Q1 Med Q3 Max
-3.812 -1.161 -0.304 0.370 3.152
Residual standard error: 1.24
Degrees of freedom: 200 total; 197 residual
Results recover most simulation parameters within good tolerances. Residual variance (after accounting for phylogenetic correlations) is 1.529 - which is much less from the assumed sigma2_s + sigma2_e
equal to 2.44.
Both in this example (pgls
model) and in the lm
example we see largely unbiased estimates of fixed effect parameters but baised variance estimates, which can be expected from data that contains additional, poorly accounted for sources of variation (e.g., phylogenetic, sampling variance).
A more realistic model can be fitted assuming that the phylogenetic signal is not 1 (as in the Brownian Motion model) but is closer to the simulated value of 0.65 used in our data generation.
<- gls(response ~ weight + sex,
model_pgls_lambda correlation = corPagel(value = 1, phy = mytree, form = ~species, fixed = FALSE),
data = simdata_pgls, method = "ML")
summary(model_pgls_lambda)
Generalized least squares fit by maximum likelihood
Model: response ~ weight + sex
Data: simdata_pgls
AIC BIC logLik
635 651 -312
Correlation Structure: corPagel
Formula: ~species
Parameter estimate(s):
lambda
0.278
Coefficients:
Value Std.Error t-value p-value
(Intercept) 15.70 0.3132 50.1 0
weight 0.00 0.0004 11.7 0
sexM 1.68 0.0915 18.4 0
Correlation:
(Intr) weight
weight -0.065
sexM 0.188 -0.021
Standardized residuals:
Min Q1 Med Q3 Max
-3.8710 -1.0238 -0.0979 0.6156 3.6153
Residual standard error: 1.15
Degrees of freedom: 200 total; 197 residual
The model estimates Pagel’s lambda to be 0.278 - arguably better agreement with our initial value (note: this value is almost certainly confounded by the fact we also have non-phylogenetic variation between species). Residual variance is in better agreement with initial assumed values: 1.331 (compared to 2.44). Bias likely stems from the random effects structure which still is not optimal (missing the non-phylogenetic within-species variance).
Maximum likelihood methods allow us to estimate additional parameters based on the tree topology and phenotypic data - which often is used to test competing models of trait evolution (e.g., a pure Brownian Motion model that assumes “diffusion-like” evolution of traits, Ornstein-Uhlenbeck model that assumes a single optimum trait value and selection “attracting” phenotypes to it, and more complex models). Let’s see how to fit such simple species-level models in R
. Note that due to limitations of phylolm
we’re ignoring sex differences (we can have 1 measurement per species).
<- simdata %>%
simdata_simple group_by(species) %>%
summarise(response = mean(response), weight = mean(weight_avg)) %>%
data.frame()
rownames(simdata_simple) <- simdata_simple$species
<- phylolm(response ~ weight,
model_phylolm_BM data = simdata_simple,
phy = mytree, model = "BM"
)<- phylolm(response ~ weight,
model_phylolm_L data = simdata_simple,
phy = mytree, model = "lambda"
)<- phylolm(response ~ weight,
model_phylolm_OU data = simdata_simple,
phy = mytree, model = "OUfixedRoot"
)
summary(model_phylolm_BM)
Call:
phylolm(formula = response ~ weight, data = simdata_simple, phy = mytree,
model = "BM")
AIC logLik
421 -207
Raw residuals:
Min 1Q Median 3Q Max
-4.276 -1.327 -0.211 0.646 4.080
Mean tip height: 8.1
Parameter estimate(s) using ML:
sigma2: 2.41
Coefficients:
Estimate StdErr t.value p.value
(Intercept) 16.687642 1.651484 10.1 <2e-16 ***
weight 0.004616 0.000355 13.0 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-squared: 0.633 Adjusted R-squared: 0.629
summary(model_phylolm_L)
Call:
phylolm(formula = response ~ weight, data = simdata_simple, phy = mytree,
model = "lambda")
AIC logLik
358 -175
Raw residuals:
Min 1Q Median 3Q Max
-4.159 -1.197 -0.113 0.701 4.059
Mean tip height: 8.1
Parameter estimate(s) using ML:
lambda : 0.384
sigma2: 0.292
Coefficients:
Estimate StdErr t.value p.value
(Intercept) 16.580403 0.390662 42.44 < 2e-16 ***
weight 0.004888 0.000514 9.51 1.4e-15 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-squared: 0.48 Adjusted R-squared: 0.475
Note: p-values and R-squared are conditional on lambda=0.384.
summary(model_phylolm_OU)
Call:
phylolm(formula = response ~ weight, data = simdata_simple, phy = mytree,
model = "OUfixedRoot")
AIC logLik
366 -179
Raw residuals:
Min 1Q Median 3Q Max
-4.088 -1.135 -0.025 0.820 4.248
Mean tip height: 8.1
Parameter estimate(s) using ML:
alpha: 1.59
sigma2: 7.64
Coefficients:
Estimate StdErr t.value p.value
(Intercept) 16.500678 0.181748 90.79 < 2e-16 ***
weight 0.004657 0.000474 9.83 2.9e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-squared: 0.496 Adjusted R-squared: 0.491
Note: p-values and R-squared are conditional on alpha=1.59.
Brownian Motion (BM) and Pagel’s lambda (lambda) models reproduce similar results as above models using GLS. The Ornstein-Uhlenbeck (OU) model struggles to robustly estimate the “attraction” parameter alpha
- which stems from the fact that there’s no clear selective optimum (note also a wildly upwardly biased estimate of the “rate” parameter (sigma2
) in this model). Judging from the AIC-based comparison of models, the lambda model is the best one (rightly so as it is the closest representation to how the data was generated). Note, however, that OU model - although inappropriate - still has better support than the BM model Tab. 2. It commonly happens that the OU model is preferred over the BM model, one of the reasons being is the additional parameter being estimated in the BM model. Use AIC comparisons with caution.
<- data.frame(Model = c("Brownian Motion",
table "Pagel's lambda",
"Ornstein-Uhlenbeck"),
AIC = c(model_phylolm_BM$aic,
$aic,
model_phylolm_L$aic))
model_phylolm_OU<- table %>%
table mutate(DeltaAIC = AIC - min(AIC)) %>%
arrange(DeltaAIC)
flextable(table) %>% autofit()
Model | AIC | DeltaAIC |
---|---|---|
Pagel's lambda | 358 | 0.00 |
Ornstein-Uhlenbeck | 366 | 7.92 |
Brownian Motion | 421 | 62.46 |
We have a number of available packages at our disposal that we could use to analyse these comparative data. We demonstrate using a few common and flexible packages: metafor
, brms
(using rstan
and STAN
) and MCMCglmm
. metafor
uses a ‘frequentist’ framework whereas brms
and MCMCglmm
are Bayesian. Given that we know the data-generating model, we’ll first show readers what this model looks like in these various packages. We’ll then demonstrate, using a single package, how the multilevel approach varies from the typical ‘species average’ analyses – analyses that ignore population-level variation within species and sampling error.
metafor
We can fit a multilevel comparative model using the metafor
package. metafor
takes a phylogentic correlation matrix, so, from our phylogentic tree we will need to first create this matrix. Readers would have noticed that this was already done in earlier sections (see Section 4.2), but we’ll show this step again.
# Create the phylogenetic correlation matrix from our phylogenetic tree
<- vcv.phylo(mytree, corr = TRUE) A
We can now use our data to fit the multilevel model:
# Fit the metafor model
<- rma.mv(response ~ weight_sp + sex, V = v,
model_metafor random = list(~1|species,
~1|phy,
~1|obs),
R = list(phy = A), data = simdata)
model_metafor
Multivariate Meta-Analysis Model (k = 1000; method: REML)
Variance Components:
estim sqrt nlvls fixed factor R
sigma^2.1 1.8546 1.3618 100 no species no
sigma^2.2 1.0262 1.0130 100 no phy yes
sigma^2.3 1.3987 1.1827 1000 no obs no
Test for Residual Heterogeneity:
QE(df = 997) = 60013.0865, p-val < .0001
Test of Moderators (coefficients 2:3):
QM(df = 2) = 480.1962, p-val < .0001
Model Results:
estimate se zval pval ci.lb ci.ub
intrcpt 15.8422 0.4189 37.8221 <.0001 15.0212 16.6631 ***
weight_sp 0.0021 0.0005 4.3836 <.0001 0.0012 0.0030 ***
sexM 1.6582 0.0775 21.3850 <.0001 1.5062 1.8101 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Note that this model makes full use of available multi-level information (using both phylogeny and non-phylogenetic between-species effects via the species
random term). In metafor, we have to explicitly model the within-species variance (obs
) recorded on top of the sampling variance (v
) associated with the estimated effect sizes (response
).
Several points are clear: - The intercept is estimated without the upward bias seen in earlier models; - The allometric slope (between-species trend in brain ~ weight
) is still downwardly biased (we still have not accounted for within-species slopes, this wopuld be possible with a random-regression component and within-subject centering); - sex differences are estimated precisely; - phenotypic heritability is estimated to be 0.24 - which is downwardly biased compared to the simulated value of 0.65.
MCMCglmm
The logic of fitting a Bayesian model in MCMCglmm
is similar. However, we have to remember about the additional complication of having to specify priors. Below we use one of the most commonly applied prior specifications (“parameter expanded” priors which are relatively uninformative across a wide range of variance values).
<- list(
prior R = list(V = 1, nu = 0.002),
G = list(
G1 = list(V = 1, nu = 0.002, alpha.mu = 0, alpha.V = 1e4),
G2 = list(V = 1, nu = 0.002, alpha.mu = 0, alpha.V = 1e4)
) )
We laso have to generate the inverse of the phylogenetic A
matrix:
<- inverseA(mytree, nodes = "ALL")$Ainv IA
Now we can fit the model:
<- MCMCglmm(response ~ sex + weight_sp,
model_mcmcglmm random = ~ species + phy,
mev = simdata$v,
data = simdata,
ginverse = list(phy = IA),
prior = prior,
verbose = FALSE,
nitt = 50000, thin = 40, burnin = 10000
)
summary(model_mcmcglmm)
Iterations = 10001:49961
Thinning interval = 40
Sample size = 1000
DIC: 3332
G-structure: ~species
post.mean l-95% CI u-95% CI eff.samp
species 1.93 1.19 2.81 825
~phy
post.mean l-95% CI u-95% CI eff.samp
phy 1.24 0.316 2.49 626
R-structure: ~units
post.mean l-95% CI u-95% CI eff.samp
units 1.4 1.27 1.54 1000
Location effects: response ~ sex + weight_sp
post.mean l-95% CI u-95% CI eff.samp pMCMC
(Intercept) 15.864588 15.053685 16.795257 888 <0.001 ***
sexM 1.660577 1.502494 1.807368 1165 <0.001 ***
weight_sp 0.002005 0.000861 0.003190 833 <0.001 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Estimated parameters are close to what we’ve seen in metafor
and in most cases they are in good agreement with the true values. The chain control parameters (nitt
, burnin
and thin
) are set to some reasonable values - but their adequacy should always be checked by inspecting trace plots of estimated parameters - e.g., for phylogenetic effect Fig. 2 & Fig. 3:
plot(model_mcmcglmm$VCV[, "phy"])
autocorr.plot(model_mcmcglmm$VCV[, "phy"])
Autocorrelation in the trace plot for the phylogenetic variance (Fig. 3) is low and does not indicate any mixing problems in the MCMC chain.
brms
brms
is a powerful package that interfaces with Stan
to fit multilevel models. It’s syntax is a little different but probably more familiar to the style used in popular mixed modelling packages, such as lme4
.
brms
requires a covariance matrix as opposed to a correlation matrix, so we need to adjust the \(\matrix{A}\) matrix:
# Create the phylogenetic correlation matrix from our phylogenetic tree.
<- vcv.phylo(mytree, corr = FALSE) Acov
Prior specification is important in Bayesian analyses, but for demonstration purposes we’ll use the default priors on parameter estimates.
brms
uses flat uniform priors for fixed effects and a Student-t distribution for sd and intercepts. One can use brms::prior_summary(model_brms)
to explore what priors are used.
<- brms::brm(
model_brms | se(sqrt(v)) ~ weight_sp + sex +
response 1 | gr(phy, cov = Acov)) +
(1 | species) +
(1 | obs),
(data = simdata, data2 = list(Acov = Acov),
chains = 2, cores = 2,
warmup = 1000, iter = 3000,
control = list(adapt_delta = 0.98), backend = "cmdstanr"
)
The model takes a little bit of time to estimate parameters, but it gives us roughly the same results as metafor
:
summary(model_brms)
Family: gaussian
Links: mu = identity; sigma = identity
Formula: response | se(sqrt(v)) ~ weight_sp + sex + (1 | gr(phy, cov = Acov)) + (1 | species) + (1 | obs)
Data: simdata (Number of observations: 1000)
Draws: 2 chains, each with iter = 3000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Group-Level Effects:
~obs (Number of levels: 1000)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept) 1.22 0.03 1.16 1.29 1.00 580 1348
~phy (Number of levels: 100)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept) 0.44 0.12 0.22 0.69 1.01 236 356
~species (Number of levels: 100)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept) 1.04 0.16 0.76 1.38 1.01 217 534
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept 14.75 0.64 13.46 16.03 1.00 440 712
weight_sp 0.00 0.00 0.00 0.00 1.01 369 848
sexM 1.59 0.08 1.43 1.76 1.01 298 594
Family Specific Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma 0.00 0.00 0.00 0.00 NA NA NA
Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Again, parameter estimates are similar to what we had in other comparative-meta approaches. We should keep in mind that our data comes from a simulation that - even if using reasonable parameter values - is still a simplification of the real world, and may cause unexpected behaviours of the models. The point of our simulations is to show the diversity of available methods - if used as a goal on their own, we would also have to explore sensitivity of our models to ranges of assumed parameter values.
An important point to note is also the singular nature of our simulation. Due to the stochastic nature of the simulation - it is just one realisation of the random process generating the data. Uncertainty in such case is inherent, especially in case of (co)variance parameters. More thorough exploration of the observed patterns would require repeating the sampling process and rerunning analyses to generate the distributions of parameters of interest, and properly estimate their sampling uncertainty.
In terms of model diagnostics, performing them for a STAN
/brms
model is similar to other MCMC-based methods. It is beyond the scope of our paper - but the basics boil down to inspecting trace plots of MCMC samples, optimizing iteration numbers and other chain parameters to achieve optimal effective sample size (ESS), and checking the sensitivity of the model to default prior specifications.
plot(model_brms)
brms
modelbrms
modelSTAN
The limitation of most linear modelling packages stems from the fact that they do not allow a lot of flexibility in defining non-standard random effects structures. Many complex models (e.g., double-hierarchical mixed models, models with custom covariance structures) are not manageable via the functions we’ve covered so far. To fit such models we need more flexible approaches such as WinBUGS
, JAGS
or STAN
. STAN
is particularly powerful due to its use of Hamiltonian MCMC simulations. The brms
package uses STAN
under the hood but moving to native STAN
code allows us to have more control over the model specification.
Below examples show example specifications of two meta-analytic comparative models. The first one is a simple BM model with additional non-phylogenetic sources of variation generated by random noise and non-phylogenetic taxonomic variation. It is analogous to models fitted above using higher-level approaches. To fit this model we first need to specify a model file (usually a separate text file we place in the working directory):
// save below chunk as model_mixed_meta_phylo.stan in your working directory
data {
int<lower=0> N; // number of data items
int<lower=0> K; // number of predictors
int<lower=0> Nran1; // number of species (non-phylo effect)
int<lower=0> Nran2; // number of observations (obs-level effect)
int<lower=0> Nran3; // number of species (phylo effect)
matrix[N, K] x; // predictor matrix
array[N] real y; // outcome vector
array[N] int<lower=1, upper=Nran1> levran1; // random effect 1 indices
array[N] int<lower=1, upper=Nran2> levran2; // random effect 2 indices
array[N] int<lower=1, upper=Nran2> levran3; // random effect 3 indices
real<lower=0> sdscal; // scaling factor for SD priors
vector[N] SE; // sampling errors of effect sizes
matrix[Nran3, Nran3] A; // phylogenetic correlation matrix
vector[Nran3] zeroes; // vector of zeroes for MVN sampling
}
parameters {
real<lower=0> tau; // error scale (SD)
real<lower=0> sigmaran1; // sd of random effect 1
real<lower=0> sigmaran2; // sd of random effect 2
real<lower=0> sigmaran3; // sd of random effect 3 (phylogeny)
vector[K] beta; // coefficients for predictors
vector[Nran1] u1; // random effect 1 coefficients
vector[Nran2] u2; // random effect 2 coefficients
vector[Nran3] u3; // random effect 3 coefficients (phylogeny)
// real zeta;
vector[N] zeta_n; // within-study linear predictor
}
model {
matrix[Nran3, Nran3] SIGMA = A * (sigmaran3)^2; // VCV matrix of phylogenetic effect
target += cauchy_lpdf(tau | 0, 2.5*sdscal); // prior for tau
target += cauchy_lpdf(sigmaran1 | 0, 2.5*sdscal); // prior for sigmaran1
target += normal_lpdf(u1 | 0, sigmaran1); // vector of rand coefs for effect 1
target += cauchy_lpdf(sigmaran2 | 0, 2.5*sdscal); // prior for sigmaran2
target += normal_lpdf(u2 | 0, sigmaran2); // vector of rand coefs for effect 2
target += cauchy_lpdf(sigmaran3 | 0, 2.5*sdscal); // prior for sigmaran3
target += multi_normal_lpdf(u3 | zeroes, SIGMA); // vector of rand coefs for effect 3 (phylo)
// within-study linear predictor
target += normal_lpdf(zeta_n | x * beta + u1[levran1] + u2[levran2] + u3[levran3], tau);
// adding sampling variance
target += normal_lpdf(y | zeta_n, SE);
}
The proposed model specification is directly based on the hierarchiucal structure that was used to simulate the data. However, the model can also be made much more efficient (important for larger phylogenies) by using Cholesky factorization of the phylogenetic VCV matrix, and non-central re-parametrisation of SD parameters (more details in the STAN User’s Guide).
To compile the model and sample the posteriors we will employ the CMDSTANR
backend - it is much faster and more robust than the more commonly used rstan
. The below code will enable running the model and generating draws from the posterior, together with summarising the output. Running of STAN
code depends strongly on local copmputer architecture - the block of code is disabled and has to be run manually. Keep in mind, that we propose minimal parameters for the MCMC chain - in practice we would need to run longer chains and check for convergence and mixing of the chain. Sensitivity to different priors should also be checked.
# define data as list of named variables
# matching STAN model definition
<- list(
mydata N = nrow(simdata),
K = 3,
Nran1 = length(unique(simdata$species)),
levran1 = as.numeric(as.factor(simdata$species)),
Nran3 = length(unique(simdata$species)),
levran3 = as.numeric(as.factor(simdata$species)),
Nran2 = length(unique(simdata$obs)),
levran2 = as.numeric(as.factor(simdata$obs)),
y = simdata$response,
x = model.matrix(~ weight_sp + sex, data = simdata),
sdscal = 1.7,
SE = sqrt(simdata$v),
A = A,
zeroes = rep(0, length(unique(simdata$species)))
)
# compile the model
<- cmdstan_model("./R/model_mixed_meta_phylo.stan")
model1
# sample from the posterior
<- model1$sample(data = mydata, chains = 1, refresh = 500)
fit1
# explore the summary
$summary() fit1
Now we can use the full power of raw STAN
to fit an alternative model - a meta-analytic comparative analysis assuming an Ornstein-Uhlenbeck (OU) process. This process assumes that, instead of diffusing throughout phenotypic space freely with some rate, the trait evolves in a constrained way “pulled” to some optimum. Phylogenetic covariance generated by such a process can take the following form: \(K_{ij}=\eta^2exp(-\rho^2D_{ij})\). \(D_{ij}\) describes the evolutionary distance between species \(i\) and \(j\), and \(\rho\) is related to the strength “pulling” the trait back to some optimum.
To fit such a model we have to modify the previous specification by including a custom Gaussian process kernel function that will define the desired VCVstructure between species. Parameters of such a function can than be sampled together with other coefficients of the model. Fitting such model in, e.g., brms
would not be possible (brms
would not be able to sample all parameters, most importantly - the OU process parameters). An approximate approach would require refitting the above brms
model with the phylogenetic tree rescaled to reflect the OU model, under several realistic values of the OU process parameters.
// save this as a separate *.stan file in your working directory
functions{
matrix cov_OU(matrix x, real sq_alpha, real sq_rho, real delta) {
int N = dims(x)[1];
matrix[N, N] K;
for (i in 1:(N-1)) {
K[i, i] = sq_alpha + delta;for (j in (i + 1):N) {
K[i, j] = sq_alpha * exp(-sq_rho * x[i,j] );
K[j, i] = K[i, j];
}
}
K[N, N] = sq_alpha + delta;return K;
}
}
data {
int<lower=0> N;
int<lower=0> K;
int<lower=0> Nran1;
int<lower=0> Nran2;
int<lower=0> Nran3;
matrix[N, K] x;
array[N] real y;
array[N] int<lower=1, upper=Nran1> levran1;
array[N] int<lower=1, upper=Nran2> levran2;
array[N] int<lower=1, upper=Nran2> levran3;
real<lower=0> sdscal;
vector[N] SE;
matrix[Nran3, Nran3] D;
vector[Nran3] zeroes;
}
parameters {
real<lower=0> tau;
real<lower=0> sigmaran1;
real<lower=0> sigmaran2;
real<lower=0> sigmaran3;
vector[K] beta;
vector[Nran1] u1;
vector[Nran2] u2;
vector[Nran3] u3;
vector[N] zeta_n;
real<lower=0> etasq;
real<lower=0> rhosq;
}
model {
matrix[Nran3, Nran3] SIGMA = cov_OU(D, etasq, rhosq, 0.01);
target += normal_lpdf(rhosq | 3, 0.25);
target += normal_lpdf(etasq | 1, 0.25);
target += cauchy_lpdf(tau | 0, 2.5*sdscal);
target += cauchy_lpdf(sigmaran1 | 0, 2.5*sdscal);
target += normal_lpdf(u1 | 0, sigmaran1);
target += cauchy_lpdf(sigmaran2 | 0, 2.5*sdscal);
target += normal_lpdf(u2 | 0, sigmaran2);
target += cauchy_lpdf(sigmaran3 | 0, 2.5*sdscal);
target += multi_normal_lpdf(u3 | zeroes, SIGMA);
target += normal_lpdf(zeta_n | x * beta + u1[levran1] + u2[levran2] + u3[levran3], tau);
target += normal_lpdf(y | zeta_n, SE);
}
We can now compile the model and sample the posterior:
<- cophenetic(mytree)
D <- D/max(D)
D <- list(
mydata N = nrow(simdata),
K = 3,
Nran1 = length(unique(simdata$species)),
levran1 = as.numeric(as.factor(simdata$species)),
Nran3 = length(unique(simdata$species)),
levran3 = as.numeric(as.factor(simdata$species)),
Nran2 = length(unique(simdata$obs)),
levran2 = as.numeric(as.factor(simdata$obs)),
y = simdata$response,
x = model.matrix(~ weight_sp + sex, data = simdata),
sdscal = 1.7,
SE = sqrt(simdata$v),
D = D,
zeroes = rep(0, length(unique(simdata$species)))
)
<- cmdstan_model("./R/model_mixed_meta_phyloou.stan")
model2 <- model2$sample(data = mydata, chains = 1, refresh = 500)
fit2
$summary() fit2
In order to run the community-level example we have to prepare an expanded version of our data - one that has multiple realisations of the measured relationship across a spatial grid of cells/communities.
We will assume a grid of 100 cells (10 by 10 spatial coordinates). The relationship of interest will be the same as previously (brain~mass
relationship - although some environmental predictor variable would probably be more realistic). We will vary this relationship across the grid by simply drawing the underlying slope parameter from a predefined distribution. To create spatial dependency we will simulate the slopes to be the strongest close to the middle of the grid, and dissipate in magnitude towards the edges.
# recaping initial parameter values and data generation
# first we wrap it into a function
# NOTE: it's not very elegant to expect function
# to see values in global environment
# but it's a quick and dirty solution for now
<- function(beta1_b_tweak) {
data_sim_grid <- rnorm(S, mean = mu_weight, sd = sqrt(sigma2_bs))
species_weights <- cbind(
X rep(1, S * si),
rep(species_weights, each = si),
rep(c(rep(0, si / 2), rep(1, si / 2)), times = S)
)<- cbind(X, rep(species_weights, each = si) +
X2 rnorm(S * si, mean = 0, sd = sqrt(sigma2_ws)))
<- c(beta0, beta1_b_tweak, beta2, beta1_w)
betas2
<- rnorm(n = S, mean = 0, sd = sqrt(sigma2_s))
u_s <- mvrnorm(n = 1, mu = rep(0, S), Sigma = C)
u_p <- rnorm(S * si, mean = 0, sd = sqrt(sigma2_e))
e <- X2 %*% betas2 + Z %*% u_s + Z %*% u_p + m + e
y2
<- data.frame(
simdata response = y2, # Average 'brain size'
weight_avg = X[, 2], # Mass / weight (g) of each species (average, mean-centred)
weight_sp = X2[, 4], # Mass / weight (g) of each species (study-level, mean-centred)
v = v, # sampling variances
sex = as.factor(X[, 3]), # Sex of sample used to estimate 'brain size' average
species = rep(species_names, each = si), # species names (each has 5 associated studies, male and female values in both)
phy = rep(species_names, each = si), # repeat of species name for phylogeny
obs = 1:length(y2) # an observation-level random effect grouping
)
levels(simdata$sex) <- c("F", "M")
return(simdata)
}
# construct the grid of coordinates
# with auxiliary variable that allows for varying slopes
# across the grid (decaying towards edges)
# then we simulate the slopes themselves
<- expand.grid(x = 1:10, y = 1:10)
mygrid $dist_center <- sqrt((mygrid$x - 5.5)^2 + (mygrid$y - 5.5)^2)
mygrid$beta1_b_tweak <- mygrid$dist_center / 1000 +
mygridrnorm(100, mean = 0, sd = 0.0005)
# simulate list of datasets for each cell
<- map(mygrid$beta1_b_tweak, data_sim_grid) simdata_list
Frist - we analyze each grid cell separately using a conventional phylogenetic meta-analysis.
# uncomment .progress to see a progress bar
# note - it will take some time to run
<- map(
grid_results
simdata_list,function(x) {
rma.mv(response ~ weight_sp + sex,
V = v,
random = list(
~ 1 | species,
~ 1 | phy,
~ 1 | obs
),R = list(phy = A), data = x
)
}# ,.progress = TRUE
)
Reshape the results to extract the slope of interest, it’s SE, and to merge it with spatial data:
$b1 <- map_vec(grid_results, function(x) coef(x)["weight_sp"])
mygrid$b1_se <- map_vec(grid_results, function(x) x$se[2])
mygrid$grid_id <- paste0(mygrid$x, "_", mygrid$y) mygrid
The resulting data has a clear spatially non-uniform distributon (see Fig. 6).
%>% ggplot(aes(x = x, y = y, fill = b1)) +
mygrid geom_tile() +
scale_fill_viridis_c() +
theme_classic() +
labs(
x = "X coordinate",
y = "Y coordinate",
fill = "Slope"
+
) theme(
text = element_text(size = 16),
aspect.ratio = 1
)
Finally, fit community model assuming the most basic form of autocorrelation function (exponential decay). To do this we first create a distance matrix between all grid cell pairs, and augment the dataset with a dummy const
variable.
<- outer(
distance X = 1:100, Y = 1:100,
function(x, y, grid) {
sqrt((grid[x, "x"] - grid[y, "x"])^2 +
"y"] - grid[y, "y"])^2)
(grid[x, grid = mygrid
},
)
dimnames(distance) <- list(mygrid$grid_id, mygrid$grid_id)
$const <- 1
mygrid<- mygrid %>% select(grid_id, b1, b1_se, const) mygrid
Then we fit the metafor
model.
<- rma.mv(yi = b1, V = b1_se^2,
model_grid_metafor random = list(~ grid_id | const),
struct = "SPEXP",
dist = list(distance),
test = "t",
sparse = TRUE,
control = list(optimizer = "optim", optmethod="Nelder-Mead"),
data = mygrid
)
summary(model_grid_metafor)
Multivariate Meta-Analysis Model (k = 100; method: REML)
logLik Deviance AIC BIC AICc
577.9344 -1155.8687 -1149.8687 -1142.0834 -1149.6161
Variance Components:
outer factor: const (nlvls = 1)
inner term: ~grid_id (nlvls = 100)
estim sqrt fixed
tau^2 0.0000 0.0020 no
rho 16.7852 no
Test for Heterogeneity:
Q(df = 99) = 780.2858, p-val < .0001
Model Results:
estimate se tval df pval ci.lb ci.ub
0.0024 0.0016 1.5191 99 0.1319 -0.0007 0.0056
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ML/REML methods | Bayesian methods | |
---|---|---|
Example packages & functions | - asreml::asreml() (paid) - metafor::rma.mv() - glmmTMB::glmmTMB |
- MCMCglmm::MCMCglmm() - brms::brms() - STAN (external, connects to rstan or cmdstan ) |
Advantages | - Fast - Easy to implement |
- Good with non-gaussian traist (incl. multivariate models) - Easy derivation of CIs - Ease of deriving posteriors of variance functions (e.g., heritability) - Flexibility in model specification |
Disadvantages | - Difficult to estimate SE of variance functions (use delta method) - Convergence problems in case of difficult data - Only approximate solutions for non-gaussian traits |
Slower (often days/weeks of runtime) - Require (sometimes careful) priors |
Testing paradigm | - Wald-type tests for fixed effects - Likelihood-ratio tests for random |
- Interval estimation |
Some packages commonly used in comparative analyses do not take advantage of a full mixed-modeling framework and will not be able to implement the full potential of a meta-comparative approach we propose (e.g., GLS libraries: phylopath::phylopath()
, phylolm::phylolm()
, ape::gls()
; models using reversible-jump Markov Chain approaches to categorical data BayesTraits; MuSSE/BiSSE; BEAST, phytools::fastAnc()
).