Random regression: Covariates, Softwares, and Model Evaluation
This post captures key concepts, comparisons, and design criteria for random regression models (RRMs) used to analyze longitudinal phenotypic and genomic data.
1. Comparison of Covariate Functions for Random Regression
When modeling longitudinal trajectories (such as milk yield or growth curves), the time covariate must be modeled using mathematical functions. The table below compares the most common choices:
1.1 Legendre Polynomials (Orthogonal Polynomials)
Traditionally the most popular choice in animal breeding (Kirkpatrick & Heckman, 1989; Meyer, 1998). Time is scaled to the interval $[-1, 1]$ and orthogonal polynomial terms are evaluated.
-
Pros: Orthogonality minimizes multicollinearity between regression coefficients, resulting in superior numerical stability and faster convergence in solvers.
-
Cons: Highly susceptible to Runge’s phenomenon (severe oscillations and artifacts at the extreme ends of the trajectory, e.g., early and late lactation).
1.2 B-Splines (Basis Splines)
Piecewise polynomials joined smoothly at knots (the model implemented in this package; White et al., 1999).
-
Pros: Local support prevents noise or outliers in one part of the trajectory (e.g., late lactation DIM > 275) from destabilizing breeding value estimations at other stages (e.g., peak yield).
-
Cons: Requires explicit design decisions regarding the number and placement of knots. Common placement strategies include equal-spacing, quantile-based positioning (placing knots at percentiles of the DIM distribution), or biologically motivated landmarks (e.g., at the start, peak, and end of lactation).
1.3 Natural Cubic Splines (NCS)
A variant of B-splines that enforces linear extrapolation beyond the boundary knots (Meyer, 2005).
-
Pros: Retains the local flexibility of splines in the mid-lactation region while preventing extreme boundary oscillations.
-
Cons: Slightly less flexible at the exact boundaries.
1.4 Biological / Empirical Functions (e.g., Wilmink or Ali-Schaeffer)
Fixed-form mathematical equations tailored to represent the typical shape of lactation curves (Ali & Schaeffer, 1984; Wilmink, 1987).
-
Pros: Highly parsimonious (3–5 parameters) with biologically interpretable coefficients (e.g., rate of decline after peak).
-
Cons: Very rigid; cannot adapt well to atypical individuals or irregular trajectories.
Model Comparison Matrix
| Feature | Legendre Polynomials | B-Splines | Natural Cubic Splines | Biological Functions |
|---|---|---|---|---|
| Mathematical Basis | Continuous orthogonal polynomials | Piecewise polynomials | Piecewise polynomials with linear boundaries | Fixed empirical equations (log/exponential) |
| Flexibility | Moderate (increases with order) | Very High (increases with knot count) | High | Low |
| Edge Stability (DIM 6 or 305) | Poor (prone to severe oscillations) | Moderate (susceptible to local noise) | Good (constrained to linear extrapolation) | Good (smooth curves) |
| Numerical Stability | Excellent (orthogonal coefficients) | Moderate (can have collinearity among adjacent knots) | Moderate | Good |
| Parameter Tuning | Only the polynomial order | Number and placement of knots | Number and placement of knots | None (fixed by formula) |
2. Available Mixed Model Software Packages
Several statistical packages are widely used to solve genomic random regression models:
2.1 WOMBAT
A specialized Fortran package developed by Karin Meyer for mixed model analyses in quantitative genetics (Meyer, 2007).
-
Methodology: Frequentist Restricted Maximum Likelihood (REML) using Average Information (AI-REML) and Expectation-Maximization (EM-REML) algorithms.
-
Random Regression: Supports custom basis functions (via user-defined
.baffiles) and heterogeneous residual classes. -
Genomic Support: Natively reads sparse inverse genomic relationship matrices (
.ginfiles).
2.2 JWAS.jl
A pure Julia package designed for single-step genomic prediction and Bayesian association studies (Cheng et al., 2018).
-
Methodology: Bayesian MCMC (Gibbs sampling) implementing various Bayesian alphabets (BayesA, BayesB, BayesC, BayesCpi, BayesR) and Bayesian GBLUP.
-
Random Regression: Achieved by defining interactions between random effects (e.g., animal, subject) and spline covariates.
-
Genomic Support: Natively supports Genomic Relationship Matrices ($G$) passed directly to the random effect prior.
2.3 MixedModels.jl
The standard frequentist package for fitting linear mixed-effects models in Julia (Bates et al., 2024).
-
Methodology: Frequentist REML using derivative-free optimization (LN_BOBYQA).
-
Random Regression: Specified via standard random slope formulas, e.g.,
(0 + b1 + b2 + b3 | animal). -
Genomic Support: Does not natively read relationship matrices. Instead, the GRM is incorporated via its Cholesky factor ($\mathbf{G} = \mathbf{L}\mathbf{L}^T$): the animal design matrix is pre-multiplied by $\mathbf{L}$, which transforms the correlated genomic random effects into i.i.d. standard normals that the package handles natively.
-
Heterogeneous Residuals: Not supported; assumes a single homogeneous residual variance across all records.
2.4 DMU
A comprehensive software package for analyzing multivariate mixed models, developed at Aarhus University (Madsen & Jensen, 2013).
-
Methodology: Frequentist REML using AI-REML, EM-REML, and Bayesian MCMC (Gibbs sampling). The Average Information (AI-REML) algorithm (Jensen et al., 1997) is utilized to achieve rapid convergence for complex multivariate mixed-effects models compared to standard EM-REML.
-
Random Regression: Supports continuous covariates and spline functions for modeling longitudinal traits.
-
Genomic Support: Natively supports pedigree-based relationships and Genomic Relationship Matrices ($G$) for GBLUP and single-step GBLUP.
2.5 MiX99
A highly optimized software suite developed by Luke (Natural Resources Institute Finland) for large-scale genetic evaluations (Strandén & Lidauer, 1999).
-
Methodology: Frequentist solving of Mixed Model Equations using preconditioned conjugate gradient iteration on data (PCG-IOD), avoiding the need to explicitly set up the coefficient matrix in memory. Strandén & Lidauer (1999) demonstrated that this iterative PCG approach solves extremely large mixed models efficiently on standard hardware.
-
Random Regression: Highly optimized for national-scale evaluations (e.g., test-day models) utilizing Legendre polynomials or B-splines. Lidauer et al. (2015) benchmarked its high computational efficiency for large-scale longitudinal analyses.
-
Genomic Support: Full support for genomic predictions, including standard GBLUP and single-step GBLUP (ssGBLUP) for millions of animals.
2.6 Software Feature Summary
| Feature | WOMBAT | JWAS.jl | MixedModels.jl | DMU | MiX99 |
|---|---|---|---|---|---|
| Estimation Method | REML (AI/EM) | Bayesian MCMC | REML (BOBYQA) | REML + MCMC | REML (PCG) |
| Random Regression | Yes (.baf files) |
Yes (interaction syntax) | Yes (formula syntax) | Yes | Yes |
| Native GRM Support | Yes (.gin sparse) |
Yes (direct) | No (Cholesky trick) | Yes | Yes |
| ssGBLUP | Limited | No | No | Yes | Yes |
| Heterogeneous Residuals | Yes | Yes | No | Yes | Yes |
| Intended Scale | Research | Research | Research | National | National |
3. Model Selection Strategy and Rationale
Rather than executing a comparative analysis across multiple software packages (e.g., WOMBAT vs. JWAS.jl vs. MixedModels.jl vs. DMU vs. MiX99), this project focuses exclusively on using WOMBAT to identify optimal random regression models.
3.1 Rationale for Focus on WOMBAT
-
Statistical Equivalence: All mixed-model software packages solve the same underlying Mixed Model Equations (MME). For the same dataset and model configuration, they yield statistically equivalent estimates of breeding values (EBVs) and variance components. Therefore, software comparison does not contribute to biological insights.
-
Model Selection Capability: Research-scale direct factorization software like WOMBAT computes exact log-likelihoods, enabling the use of statistical criteria (AIC, BIC, and Likelihood Ratio Tests) to optimize the B-spline knot layout.
-
Limitations of National-Scale Solvers: While national-scale solvers (such as MiX99 or DMU) scale efficiently to millions of records using Iteration on Data (IOD; Strandén & Lidauer, 1999) and Preconditioned Conjugate Gradient (PCG) methods, they do not natively compute exact likelihood values or standard errors, making systematic model selection (e.g., finding optimal B-spline knot layouts) difficult. Computational benchmarking shows that their iterative solvers prioritize execution speed on massive evaluations (Lidauer et al., 2015) rather than likelihood profile calculations.
-
Genomic Support: WOMBAT natively and efficiently reads sparse inverse Genomic Relationship Matrices ($\mathbf{G}_{\text{inv}}$), making it highly suitable for our genomic random regression analysis.
3.2 Benchmark Models Evaluated
Two benchmark B-spline configurations are evaluated in WOMBAT to optimize the lactation curve model:
-
The Baseline Model (3k-d): A simple 3-knot density model (using knots
[6, 145, 305]for 305-day or[6, 145, 275]for 275-day analyses). With only 16–17 parameters, this model converges rapidly (under 10 seconds) and exhibits minimal multicollinearity. It is utilized as a baseline to verify data pipeline correctness. -
The Complex Model (5k-v / 5k-d): A 5-knot peaks-based (
[6, 69, 123, 239, 275]) or density-based ([6, 76, 145, 216, 275]) model. With 34–35 parameters, this model represents the optimal balance between biological flexibility and statistical parsimony.
4. Model Evaluation Criteria
To determine if a random regression model is statistically optimal and biologically realistic, apply the following evaluation criteria:
4.1 Goodness-of-Fit and Information Criteria
-
Akaike & Bayesian Information Criteria (AIC & BIC): Evaluate model fit while penalizing parameter complexity: $$\text{AIC} = -2\log L + 2p$$ $$\text{BIC} = -2\log L + p\log(N - r)$$ where $p$ is the number of covariance parameters, $N$ is the number of records, and $r$ is the rank of the fixed effects design matrix.
- Judgement: Lower values are better. Use BIC as the primary metric for large longitudinal datasets, as it penalizes overparameterization more aggressively. Note that WOMBAT, DMU, and MixedModels.jl may use different likelihood constants, so AIC/BIC values should only be compared within the same software.
-
Likelihood Ratio Test (LRT): For nested model comparisons (e.g., comparing $K$ knots to $K+1$ knots): $$\text{LRT} = 2 \times (\log L_{\text{complex}} - \log L_{\text{simple}})$$ Compare against a $\chi^2$ distribution with degrees of freedom equal to the difference in covariance parameters.
4.2 Eigenanalysis of Genetic Covariance Matrices
Analyze the eigenvalues and eigenvectors of the genetic covariance matrix of the coefficients ($\mathbf{K}_g$):
-
Eigenvalue Variance Explained: Perform eigen-decomposition ($\mathbf{K}_g = \mathbf{V}\mathbf{D}\mathbf{V}^T$). The first 2 or 3 eigenvalues should explain $> 95%$ of the total genetic variance. If the 4th or 5th eigenvalues are extremely small, it indicates the spline model is overparameterized and fitting noise.
-
Eigenvector Trajectory Shapes: Plot the eigenvectors over time (DIM):
-
PC1 (Largest Eigenvalue): Represents the overall height of the lactation curve (should be flat and positive).
-
PC2: Represents the slope of the curve (lactation persistency).
-
PC3: Represents the curvature (timing of the peak).
-
Overfitting Warning: If the curves of the first three PCs fluctuate rapidly or erratically, the model is overfitting.
-
4.3 Biological Trajectory Validation
Plot the variance components and herizability ($h^2$) across the entire trajectory:
-
Heritability Trajectory: For milk yield, heritability should be stable and follow a realistic curve, typically peaking in mid-lactation ($0.15$–$0.25$) and dropping slightly at the edges. Sharp spikes (e.g., $h^2 \to 0.9$) or drops to zero indicate poor local knot placement.
-
Genetic Correlation Heatmaps: Compute the genetic correlation between days $t_1$ and $t_2$: $$r_g(t_1, t_2) = \frac{\text{Cov}_g(t_1, t_2)}{\sqrt{\text{Var}_g(t_1) \text{Var}_g(t_2)}}$$
-
Judgement: Correlation between adjacent days should be near $1.0$ and decrease smoothly and monotonically as the distance $|t_1 - t_2|$ increases.
-
Boundary Artifacts: Genetic correlations across lactation should remain positive ($\ge 0.3$). Negative correlations between early and late lactation indicate boundary instability (Runge’s phenomenon or lack of boundary constraints).
-
References
-
Ali, T. E., & Schaeffer, L. R. (1984). Accounting for lactation curves in dairy cattle. Canadian Journal of Animal Science, 64(2), 363-372.
-
Bates, D., Alday, P. M., & Vasishth, S. (2024). MixedModels.jl: Mixed-effects models in Julia. Zenodo. https://doi.org/10.5281/zenodo.17091233
-
Cheng, H., Garrick, D. J., & Fernando, R. L. (2018). JWAS: Julia for Whole-genome Analysis Software. Journal of Open Source Software, 3(30), 918.
-
Jensen, J., Mäntysaari, E. A., Madsen, P., & Thompson, R. (1997). Residual maximum likelihood estimation of (co)variance components in multivariate mixed linear models using average information. Journal of the Indian Society of Agricultural Statistics, 49, 215–236.
-
Kirkpatrick, M., & Heckman, N. (1989). A quantitative genetic model for growth, shape, reaction norms, and other infinite-dimensional characters. Journal of Mathematical Biology, 27(4), 429-450.
-
Lidauer, M. H., Strandén, I., Mäntysaari, E. A., Pösö, J., & Aamand, G. P. (2015). Computational efficiency of test-day model evaluations. Journal of Dairy Science, 98(7), 4983-4993.
-
Madsen, P., & Jensen, J. (2013). A User’s Guide to DMU. A Package for Analysing Multivariate Mixed Models. Version 6, release 5.2. Center for Quantitative Genetics and Genomics, Aarhus University.
-
Meyer, K. (1998). Estimating covariance functions for longitudinal data using a random regression model. Genetics Selection Evolution, 30(3), 221-240.
-
Meyer, K. (2005). Random regression analyses using B-splines to model growth of Australian Angus cattle. Genetics Selection Evolution, 37(5), 473-500.
-
Meyer, K. (2007). WOMBAT—A program for mixed model analyses by restricted maximum likelihood. Journal of Zhejiang University Science B, 8(11), 815-821.
-
Strandén, I., & Lidauer, M. (1999). Solving large mixed model equations using preconditioned conjugate gradient iteration on data. Journal of Dairy Science, 82(12), 2779-2787.
-
VanRaden, P. M. (2008). Efficient methods to compute genomic predictions. Journal of Dairy Science, 91(11), 4414-4423.
-
White, I. M. S., Thompson, R., & Brotherstone, S. (1999). Genetic and environmental smoothing of lactation curves with B-splines. Journal of Dairy Science, 82(9), 1880-1887.
-
Wilmink, J. B. M. (1987). Adjustment of test-day milk, fat and protein yields of dairy cows for age, season and stage of lactation. Livestock Production Science, 16(4), 321-334.