Fast Estimation of Random Effect Models

dummy slide

Introduction

Introduction Example

We have clustered or somehow dependent data.

Twins, individuals treated at the same hospital, repeated measurements etc.

Interested in modeling the dependence per se.

Focus on clustered data.

Examples

Family data in epidemiology

using the Swedish multi-generation register.

In sociology and econometrics.

Random intercept and slope models for neighborhood effects, school effects, etc.

Modeling trajectories of biomarkers of kidney function

accounting for diabetic complications and medication. Using data from the Swedish SCREAM project.

Introduction Example (cont.)

Outcomes \(\vec Y_i = (Y_{i1}, \dots, Y_{in_i})^\top\), model parameters \(\vec\theta\), and random effects \(\vec U_i\) for cluster \(i\).

\(\vec Y_i\) given \(\vec U_i = \vec u\) has density \(h_i(\vec y;\vec u, \vec\theta)\) and \(\vec U_i\) has density \(g_i(\vec u; \vec\theta)\).

Introduction Example (cont.)

The log marginal likelihood term is

\[ l_i(\vec\theta) = \log\int h_i(\vec Y_i;\vec u, \vec\theta)g_i(\vec u; \vec\theta)\der\vec u. \]

Typically intractable.

No closed-form solution.

Issue: current approximations have long estimation time or large bias.

Overview

Use variational approximations to approximate the log marginal likelihood terms, \(l_i\)’s.

Use structure of the problem to perform fast estimation.

Made a package for that.

Variational Approximations

Select some variational distribution with density \(q_i(\vec u;\vec\omega_i)\) for \(\vec\omega_i \in \Omega_i\) and use

\[\begin{align*} l_i(\vec\theta) &\geq \int q_i(\vec u; \vec\omega_i) \log\left(\frac{h_i(\vec Y_i;\vec u, \vec\theta)g_i(\vec u; \vec\theta)} {q_i(\vec u; \vec\omega_i)}\right) \der\vec u \\ &= \tilde l_i(\vec\theta, \vec\omega_i). \end{align*}\]

Point: replace with “more tractable” lower bound.

Approximate Maximum Likelihood

\[ \argmax_{\vec\theta}\,\sum_{i = 1}^m l_i(\vec\theta) \approx \argmax_{\vec\theta} \max_{\vec\omega_1, \dots, \vec\omega_m} \sum_{i = 1}^m \tilde l_i(\vec\theta, \vec\omega_i). \]

Parameters: \(\text{dim}(\vec\theta) + \sum_{i = 1}^m \text{dim}(\vec\omega_i) \gg \text{dim}(\vec\theta)\)!

… but the Hessian is sparse (partially separable).

Optimization Method

Quick Remarks

Briefly cover details. A bit technical (bear with me!).

See Nocedal and Wright (2006) and the package mentioned later for details.

Issues with Quasi-Newton Methods

Have to optimize

\[ \tilde l(\vec \theta, \vec\omega_1,\dots,\vec\omega_m) = \sum_{i = 1}^m \tilde l_i(\vec\theta, \vec\omega_i). \]

Quasi-Newton methods creates an approximation of the Hessian of \(\tilde l\).

E.g. BFGS.

… does not preserve the sparsity.

Partially Separability

Make \(m\) BFGS updates to approximate the Hessian of \(\tilde l_1(\vec\theta, \vec\omega_1), \dots, \tilde l_m(\vec\theta, \vec\omega_m)\).

Solve using conjugate gradient.

Fast as we preserve the sparsity.

May yield better Hessian approximation.

Comparing with BFGS

Suppose \(\text{dim}(\vec\theta) = p\), \(\text{dim}(\vec\omega_i) = s\), and at most \(n_{cg}\) conjugate gradient iterations.

Ratio of flops between BFGS and the new methods is (roughly)

\[ \frac{p^2 + 2mps + (ms)^2} {n_{cg}(p^2 + 2pms + ms^2) + mp^2} \approx \frac m{n_{cg}}. \]

for large \(m\) and \(s\) slightly larger than \(p\). Favorable if \(n_{cg} \ll m\).

The psqn Package

Implemented in the psqn package.

Both R and C++ interface.

Supports computation in parallel

using OpenMP for C++ implementations.

Applicable to other partially separable problems!

E.g. inner optimization in Laplace method.

Example

GLMM Example

Take a GLMM with a logit link function.

Use \(q_i(\vec u;\vec\lambda,\mat \Lambda) = \phi^{(K)}(\vec u; \vec \lambda,\mat\Lambda)\) (a GVA).

Gaussian variational approximations (Ormerod and Wand 2012).

Compare with Laplace approximation

from lme4 (Bates et al. 2015).

Compare with L-BFGS

from the lbfgs package (Coppola, Stewart, and Okazaki 2020).

GLMM Example (cont.)

Random intercept and slope model.

E.g. schools or neighborhoods.

Three fixed effects, three random effects per cluster, \(m = 1000\) clusters, and \(n_i = 10\) observations in each cluster.

GLMM Example (cont.)

\[\begin{align*} Y_{ij} &\sim \text{Bin}\left(1, (1 + \exp(-\eta_{ij}))^{-1}\right) \\ \eta_{ij} &= (\vec\beta + \vec U_i)^\top\vec x_{ij} \\ \vec x_{ij} &= (1, Z_{ij1}, Z_{ij2}) \\ Z_{ij1}, Z_{ij2} &\sim N(0, 1) \\ \vec\beta &= (-2, 2^{-1/2}, 2^{-1/2}) \\ \vec U_i &\sim N^{(3)}(\vec 0_3, 3^{-1}\mat I_3) \end{align*}\]

Findings

Variational approximations are faster.

Yields lower bias in this case.

The new optimization method is much faster than generic alternatives.

Bias: Fixed Effect Coefficients

	Inter	SE	X1	SE	X2	SE
Laplace	0.1796	0.00423	-0.05136	0.00383	-0.05472	0.00363
GVA	0.0255	0.00465	-0.00603	0.00369	-0.00851	0.00387
GVA LBFGS	0.0256	0.00465	-0.00605	0.00369	-0.00853	0.00387

The SE columns show the standard errors.

Average Computation Times

	Computation time (sec.)	SE
Laplace	9.736	0.54340
GVA	1.717	0.02483
GVA (4 threads)	0.487	0.00737
GVA LBFGS	30.166	0.63334

The SE columns show the standard errors.

Bias: Random Effects’ Standard Deviations

	Inter	SE	X1	SE	X2	SE
Laplace	-0.0329	0.00573	-0.4594	0.00778	-0.4558	0.00725
GVA	-0.0282	0.00576	-0.0297	0.00563	-0.0183	0.00566
GVA LBFGS	-0.0282	0.00576	-0.0298	0.00564	-0.0183	0.00567

The SE columns show the standard errors.

Bias: Correlation Coefficients

	Inter:X1	SE	Inter:X1	SE	X1:X2	SE
Laplace	-0.38539	0.0606	-0.33844	0.0573	0.16671	0.0752
GVA	-0.00593	0.0141	-0.00630	0.0113	-0.00783	0.0141
GVA LBFGS	-0.00587	0.0141	-0.00623	0.0113	-0.00787	0.0142

The SE columns show the standard errors.

Extensions and Conclusions

Conclusions

Introduced variational approximations and its related optimization problem.

May yield low bias.

Quasi-Newton methods for partially separable can yield estimates fast.

Variational Approximations

Can use restricted Gaussian variational approximations for fast estimation.

Particularly useful for crossed and nested random effects. E.g. Challis and Barber (2013), Miller, Foti, and Adams (2017), and Ong, Nott, and Smith (2018).

Use other variational distributions.

The psqn Package

Add Newton conjugate gradient method.

Add constraints.

Add a trust region method.

Add other sparsity patterns of the Hessian.

Thank You!

The presentation is at rpubs.com/boennecd/SWE-REG-spring-21.

The markdown is at github.com/boennecd/Talks.

More examples at github.com/boennecd/psqn-va-ex.

The psqn package is on CRAN and at github.com/boennecd/psqn.

References are on the next slide.

References

Bates, Douglas, Martin Mächler, Ben Bolker, and Steve Walker. 2015. “Fitting Linear Mixed-Effects Models Using lme4.” Journal of Statistical Software 67 (1): 1–48. https://doi.org/10.18637/jss.v067.i01.

Challis, Edward, and David Barber. 2013. “Gaussian Kullback-Leibler Approximate Inference.” Journal of Machine Learning Research 14 (32): 2239–86.

Coppola, Antonio, Brandon Stewart, and Naoaki Okazaki. 2020. Lbfgs: Limited-Memory Bfgs Optimization.

Miller, Andrew C., Nicholas J. Foti, and Ryan P. Adams. 2017. “Variational Boosting: Iteratively Refining Posterior Approximations.” In Proceedings of the 34th International Conference on Machine Learning, edited by Doina Precup and Yee Whye Teh, 70:2420–9. Proceedings of Machine Learning Research. International Convention Centre, Sydney, Australia: PMLR.

Nocedal, Jorge, and Stephen Wright. 2006. Numerical Optimization. 2nd ed. Springer Science & Business Media. https://doi.org/10.1007/978-0-387-40065-5.

Ong, Victor M.-H., David J. Nott, and Michael S. Smith. 2018. “Gaussian Variational Approximation with a Factor Covariance Structure.” Journal of Computational and Graphical Statistics 27 (3): 465–78. https://doi.org/10.1080/10618600.2017.1390472.

Ormerod, J. T., and M. P. Wand. 2012. “Gaussian Variational Approximate Inference for Generalized Linear Mixed Models.” Journal of Computational and Graphical Statistics 21 (1): 2–17. http://www.jstor.org/stable/23248820.