\[ \renewcommand\vec{\boldsymbol} \def\bigO#1{\mathcal{O}(#1)} \def\Cond#1#2{\left(#1\,\middle|\, #2\right)} \def\mat#1{\boldsymbol{#1}} \def\der{{\mathop{}\!\mathrm{d}}} \def\argmax{\text{arg}\,\text{max}} \def\Prob{\text{P}} \def\Expec{\text{E}} \def\logit{\text{logit}} \def\diag{\text{diag}} \]
Twins, individuals treated at the same hospital, repeated measurements etc.
Interested in modeling the dependence per se.
Focus on clustered data.
using the Swedish multi-generation register.
Random intercept and slope models for neighborhood effects, school effects, etc.
accounting for diabetic complications and medication. Using data from the Swedish SCREAM project.
Outcomes \(\vec Y_i = (Y_{i1}, \dots, Y_{in_i})^\top\), model parameters \(\vec\theta\), and random effects \(\vec U_i\) for cluster \(i\).
\(\vec Y_i\) given \(\vec U_i = \vec u\) has density \(h_i(\vec y;\vec u, \vec\theta)\) and \(\vec U_i\) has density \(g_i(\vec u; \vec\theta)\).
The log marginal likelihood term is
\[ l_i(\vec\theta) = \log\int h_i(\vec Y_i;\vec u, \vec\theta)g_i(\vec u; \vec\theta)\der\vec u. \]
No closed-form solution.
Issue: current approximations have long estimation time or large bias.
Use variational approximations to approximate the log marginal likelihood terms, \(l_i\)’s.
Made a package for that.
Select some variational distribution with density \(q_i(\vec u;\vec\omega_i)\) for \(\vec\omega_i \in \Omega_i\) and use
Point: replace with “more tractable” lower bound.
Parameters: \(\text{dim}(\vec\theta) + \sum_{i = 1}^m \text{dim}(\vec\omega_i) \gg \text{dim}(\vec\theta)\)!
… but the Hessian is sparse (partially separable).
See Nocedal and Wright (2006) and the package mentioned later for details.
Have to optimize
\[ \tilde l(\vec \theta, \vec\omega_1,\dots,\vec\omega_m) = \sum_{i = 1}^m \tilde l_i(\vec\theta, \vec\omega_i). \]
E.g. BFGS.
… does not preserve the sparsity.
Make \(m\) BFGS updates to approximate the Hessian of \(\tilde l_1(\vec\theta, \vec\omega_1), \dots, \tilde l_m(\vec\theta, \vec\omega_m)\).
Fast as we preserve the sparsity.
May yield better Hessian approximation.
Suppose \(\text{dim}(\vec\theta) = p\), \(\text{dim}(\vec\omega_i) = s\), and at most \(n_{cg}\) conjugate gradient iterations.
Ratio of flops between BFGS and the new methods is (roughly)
\[ \frac{p^2 + 2mps + (ms)^2} {n_{cg}(p^2 + 2pms + ms^2) + mp^2} \approx \frac m{n_{cg}}. \]
for large \(m\) and \(s\) slightly larger than \(p\). Favorable if \(n_{cg} \ll m\).
Implemented in the psqn package.
Both R and C++ interface.
using OpenMP for C++ implementations.
E.g. inner optimization in Laplace method.
Take a GLMM with a logit link function.
Gaussian variational approximations (Ormerod and Wand 2012).
from lme4 (Bates et al. 2015).
from the lbfgs package (Coppola, Stewart, and Okazaki 2020).
E.g. schools or neighborhoods.
Three fixed effects, three random effects per cluster, \(m = 1000\) clusters, and \(n_i = 10\) observations in each cluster.
\[\begin{align*} Y_{ij} &\sim \text{Bin}\left(1, (1 + \exp(-\eta_{ij}))^{-1}\right) \\ \eta_{ij} &= (\vec\beta + \vec U_i)^\top\vec x_{ij} \\ \vec x_{ij} &= (1, Z_{ij1}, Z_{ij2}) \\ Z_{ij1}, Z_{ij2} &\sim N(0, 1) \\ \vec\beta &= (-2, 2^{-1/2}, 2^{-1/2}) \\ \vec U_i &\sim N^{(3)}(\vec 0_3, 3^{-1}\mat I_3) \end{align*}\]
Variational approximations are faster.
Yields lower bias in this case.
The new optimization method is much faster than generic alternatives.
Inter | SE | X1 | SE | X2 | SE | |
---|---|---|---|---|---|---|
Laplace | 0.1796 | 0.00423 | -0.05136 | 0.00383 | -0.05472 | 0.00363 |
GVA | 0.0255 | 0.00465 | -0.00603 | 0.00369 | -0.00851 | 0.00387 |
GVA LBFGS | 0.0256 | 0.00465 | -0.00605 | 0.00369 | -0.00853 | 0.00387 |
The SE columns show the standard errors.
Computation time (sec.) | SE | |
---|---|---|
Laplace | 9.736 | 0.54340 |
GVA | 1.717 | 0.02483 |
GVA (4 threads) | 0.487 | 0.00737 |
GVA LBFGS | 30.166 | 0.63334 |
The SE columns show the standard errors.
Inter | SE | X1 | SE | X2 | SE | |
---|---|---|---|---|---|---|
Laplace | -0.0329 | 0.00573 | -0.4594 | 0.00778 | -0.4558 | 0.00725 |
GVA | -0.0282 | 0.00576 | -0.0297 | 0.00563 | -0.0183 | 0.00566 |
GVA LBFGS | -0.0282 | 0.00576 | -0.0298 | 0.00564 | -0.0183 | 0.00567 |
The SE columns show the standard errors.
Inter:X1 | SE | Inter:X1 | SE | X1:X2 | SE | |
---|---|---|---|---|---|---|
Laplace | -0.38539 | 0.0606 | -0.33844 | 0.0573 | 0.16671 | 0.0752 |
GVA | -0.00593 | 0.0141 | -0.00630 | 0.0113 | -0.00783 | 0.0141 |
GVA LBFGS | -0.00587 | 0.0141 | -0.00623 | 0.0113 | -0.00787 | 0.0142 |
The SE columns show the standard errors.
Introduced variational approximations and its related optimization problem.
May yield low bias.
Quasi-Newton methods for partially separable can yield estimates fast.
Particularly useful for crossed and nested random effects. E.g. Challis and Barber (2013), Miller, Foti, and Adams (2017), and Ong, Nott, and Smith (2018).
Use other variational distributions.
Add Newton conjugate gradient method.
Add constraints.
Add a trust region method.
Add other sparsity patterns of the Hessian.
The presentation is at rpubs.com/boennecd/SWE-REG-spring-21.
The markdown is at github.com/boennecd/Talks.
More examples at github.com/boennecd/psqn-va-ex.
The psqn package is on CRAN and at github.com/boennecd/psqn.
References are on the next slide.
Bates, Douglas, Martin Mächler, Ben Bolker, and Steve Walker. 2015. “Fitting Linear Mixed-Effects Models Using lme4.” Journal of Statistical Software 67 (1): 1–48. https://doi.org/10.18637/jss.v067.i01.
Challis, Edward, and David Barber. 2013. “Gaussian Kullback-Leibler Approximate Inference.” Journal of Machine Learning Research 14 (32): 2239–86.
Coppola, Antonio, Brandon Stewart, and Naoaki Okazaki. 2020. Lbfgs: Limited-Memory Bfgs Optimization.
Miller, Andrew C., Nicholas J. Foti, and Ryan P. Adams. 2017. “Variational Boosting: Iteratively Refining Posterior Approximations.” In Proceedings of the 34th International Conference on Machine Learning, edited by Doina Precup and Yee Whye Teh, 70:2420–9. Proceedings of Machine Learning Research. International Convention Centre, Sydney, Australia: PMLR.
Nocedal, Jorge, and Stephen Wright. 2006. Numerical Optimization. 2nd ed. Springer Science & Business Media. https://doi.org/10.1007/978-0-387-40065-5.
Ong, Victor M.-H., David J. Nott, and Michael S. Smith. 2018. “Gaussian Variational Approximation with a Factor Covariance Structure.” Journal of Computational and Graphical Statistics 27 (3): 465–78. https://doi.org/10.1080/10618600.2017.1390472.
Ormerod, J. T., and M. P. Wand. 2012. “Gaussian Variational Approximate Inference for Generalized Linear Mixed Models.” Journal of Computational and Graphical Statistics 21 (1): 2–17. http://www.jstor.org/stable/23248820.