Explores training dynamics of deep neural networks through a statistical lens, examining the duality between optimization and inference perspectives.
Stay in the loop on research in AI and physical intelligence.
Training Deep Networks as Random Effects: An Optimization–Inference Duality.
In traditional deep learning we think of training a network as a purely optimization process: adjust weights via gradient descent until the loss is small. Early stopping or regularization is applied more as a practical hack than as a principled inference choice. This new paper by Yao et al. (2026) offers a fresh perspective. It shows that, in the infinite-width “Neural Tangent Kernel” (NTK) regime, training a deep network can be viewed exactly as performing inference in a statistical mixed-effects model. In this view, the training time itself acts like a variance hyperparameter in a Bayesian model, and early stopping becomes an empirical-Bayes decision rule. Concretely, the authors prove that the output of gradient-flow training for a wide network is identical to the Best Linear Unbiased Predictor (BLUP) of a classical random-effects model.
Put differently, imagine a probabilistic model where the network’s latent signal is a Gaussian random effect with covariance tied to the NTK at initialization. As training proceeds, more of that latent variance is revealed. The paper shows that the network’s predictions at any time $t$ coincide with the posterior mean of that latent signal under a model where the random-effect variance grows with $t$. In this framework, training time is a variance component (the usual way statisticians allocate variation between noise and signal), so choosing when to stop training is just like choosing a prior variance in an empirical-Bayes fashion.
This optimization–inference duality means the gradient-descent trajectory is doing double duty: it’s minimizing the loss and it’s tracing out a path of empirical-Bayes inference. Condition on a chosen time $t$: the trained net’s output is exactly the posterior mean of the latent random effects given the data. And instead of tuning $t$ by a held-out validation loss, the authors show you can estimate $t$ by maximizing the likelihood (more precisely, the restricted likelihood) of the observed data under the random-effects model. In simple terms, early stopping is recast as a fully principled statistical inference step rather than an external heuristic.
The paper distills this insight into a two-stage statistical procedure for training:.
Test if training is needed. First, use a variance-component score test to check whether there is any real structure in the data that training can capture. Equivalently, test the null hypothesis “training time $t=0$” (i.e., the network remaining at random initialization) versus “$t>0$” (the network moves away from initialization). If the test fails to reject, we conclude the initialization already explains the data about as well as we could hope, and further training might just fit noise.
REML-based early stopping. If the test does indicate structure beyond the initialization, then we proceed to train. But instead of using a validation set, we compute the restricted maximum likelihood (REML) estimate of $t$. This gives a single “optimal” stopping time $\hat t_{\rm REML}$. The resulting rule has a clean interpretation: in the eigenbasis of the NTK, one increases training until the fit on each eigen-direction decorrelates from its eigenvalue. In other words, one trains until the network has adequately extracted signal from all dominant modes without overfitting the small-noise modes. Remarkably, they prove this REML stopping time is asymptotically optimal. With a fixed design, it achieves the minimum possible in-sample error, and under reasonable random-design assumptions it also minimizes out-of-sample error. In short, this view turns every aspect of training time selection—“Should I train? How long?”—into a likelihood-based inference problem.
From NTK Gradient Flow to Random Effects.
To appreciate the result, recall the NTK picture of wide neural nets. In the infinite-width limit (or very large networks), we can linearize the network around its initialization. Gradient descent (or, in continuous time, gradient flow) then effectively solves a linear least-squares problem in a very high-dimensional feature space. Concretely, if $\Theta_\infty$ is the limiting NTK matrix on the training data, then the network’s predictions evolve according to a simple differential equation involving $\Theta_\infty$. The classic result is that, after infinite time, you end up with the minimum-norm (or “ridgeless”) regression solution in that feature space. But at finite training time, you have a partial fit: the solution is $(I - e^{-t\Theta_\infty})(y - f_0(x)) + f_0(x)$, where $f_0(x)$ is the network output at initialization and $y$ are the targets.
What Yao et al. show is that this partial fit is exactly the BLUP of a random-effects formulation. Imagine a statistical model: $$y = f_0(X) + u_t + \varepsilon_t,$$ where $u_t$ is a Gaussian random vector of latent effects, and $\varepsilon_t$ is noise. They choose $$u_t \sim \mathcal{N}(0,\ \sigma_u^2 (\exp(t\Theta_\infty)-I)),\qquad \varepsilon_t \sim \mathcal{N}(0,\ \sigma_\varepsilon^2 I),$$ with $\sigma_\varepsilon^2$ scaled appropriately to make the math work out. In this model, as $t$ grows, the covariance $\sigma_u^2(\exp(t\Theta_\infty)-I)$ effectively injects more variance (“structure”) into the latent $u_t$.
Gaussian mixed-model theory then tells us the BLUP (posterior mean) of $u_t$ given $y$ is $$\mathbb{E}[u_t \mid y] = (I - e^{-t\Theta_\infty})(y - f_0(X)).$$ But that exactly matches the network’s training prediction offset from initialization. Thus for every $t$, the network’s predictions are the same as performing Gaussian inference for $u_t$. In other words, training to time $t$ is doing Bayesian inference in this latent-variable model. This is why the authors call it an “optimization–inference duality.” Training time $t$ controls how much of the data variance is attributed to the latent signal versus the noise, just like the variance component in a mixed model.
Two points stand out. First, the network’s output at time $t$ is not any exotic solution – it’s simply the posterior mean of the latent signal in a model with NTK covariance. Second, at $t=0$ we have $u_0=0$, so the model reduces to $y=f_0(x)+\varepsilon$: the network is “predicting” only by its random initialization. As $t$ increases, we gradually allow more flexibility via $u_t$, smoothly moving from the prior ($t=0$) toward the data.
Early Stopping as Empirical Bayes.
This framework has a powerful consequence: it makes it natural to determine how long to train via likelihood rather than trial-and-error. In classical statistics, estimating a variance component like this one is done by Restricted Maximum Likelihood (REML). The authors derive the marginal likelihood of the data under the random-effects model (as a function of $t$) and show that maximizing it yields a unique $\hat t_{\rm REML}$. Amazingly, this criterion can be computed from just the spectrum of the NTK and the training labels.
In fact, they express the REML objective in the eigenbasis of $\Theta_\infty$. Denote the eigenvalues by ${\lambda_i}$ and let $c_i$ be the projection of $(y - f_0)$ onto the $i$-th eigenvector. Then maximizing the likelihood over $t$ is equivalent to solving $$Q(t) = n \log!\Bigl(\sum_{i=1}^n c_i^2 e^{-t\lambda_i}\Bigr) + t\sum_{i=1}^n \lambda_i,$$ for $t$. The unique solution $\hat t$ satisfies a simple equation: it is the time when the empirical covariance between the “spectral losses” $c_i^2 e^{-t\lambda_i}$ and the eigenvalues $\lambda_i$ is zero (here I paraphrase the result). Intuitively, gradient descent learns the highest-eigenvalue (most dominant) modes of the data first; as $t$ grows, it starts fitting the smaller-eigenvalue (noisier) modes. The REML solution finds the sweet spot where this fit is balanced – essentially, no correlation means we’ve neither under-fitted the large modes nor overfitted the noisy ones. After this point, further training would mostly reduce noise at the expense of flip-flopping on the small eigen-directions.
Thus, REML recommends stopping precisely when the spectral loss decorrelation is achieved. The authors further quantify the model’s complexity at time $t$ by an effective degrees-of-freedom: $${\rm edf}(t) = \sum_{i=1}^n \bigl(1 - e^{-t\lambda_i}\bigr),$$ which grows from 0 up to $n$ as $t\to\infty$. In practice, one can compute $\hat t_{\rm REML}$ numerically by a root-finding or line search on the above quantities, without needing a hold-out set. The chosen $t$ then defines a principled early-stopping point for the full training.
Two-Stage Training Decision and Tests.
The power of this view is it doesn’t just pick $t$ blindly; it first asks if we should train at all. In classical mixed models a variance-component test (score test) is used to test $H_0: \sigma_u^2=0$. Analogously, the authors set up a score test for $t=0$ vs $t>0$: effectively testing whether the NTK-based latent signal significantly improves fit over the initialization. They remove the initial prediction $f_0(X)$ from the data (project onto the orthogonal complement) and compute a test statistic akin to $(n-1)\tilde y^T \tilde H_\infty \tilde y / \tilde y^T \tilde y$ (see the paper for exact form). If this score test is not significant, it implies the initialization alone was as good as any partially-trained net, and one might as well stop immediately (i.e. $t=0$). In practice, this could save needless computation when the data has no learnable pattern under the NTK model.
If the test rejects, it means there is structure beyond the random start, and then we go on to the REML rule to find $\hat t$. Thus the procedure is: 1) test $t=0$ vs $t>0$; 2) if training is warranted, train with early stopping at $\hat t_{\rm REML}$. The authors show this two-stage method controls false positives (type I error) for the test and has good power. In experiments, the score test correctly only triggers when there’s a genuine signal, and the REML stopping produces a learning curve near the optimum.
Spectral Perspective and Theoretical Guarantees.
Reframing training as eigenfiltered inference gives nice intuition. The NTK’s eigenvalues often decay quickly: a few large eigenvalues capture the main structure, and the rest are small “noise” modes. Early training (small $t$) allows the large-eigenvalue directions to fit (since $1-e^{-t\lambda} \approx 1$ for large $\lambda$ even if $t$ is modest), while the small modes are still damped ($1-e^{-t\lambda}\approx t\lambda$ for tiny $\lambda$). Over time, the fit saturates the dominant modes and then gradually creeps into the minor modes. The REML rule essentially finds the $t$ where the contributions across modes are balanced.
Importantly, the authors prove that this REML-derived stopping time is asymptotically optimal in terms of prediction error. For a fixed training set (fixed design), they show that the in-sample prediction error at $\hat t_{\rm REML}$ is (in probability) essentially as low as the best possible error over any $t$. Symbolically, $E_n(\hat t_{\rm REML}) / \min_{t\ge0} E_n(t) \to 1$. Under extra regularity assumptions, a similar guarantee holds for out-of-sample/generalization error. Notably, these optimality results do not assume Gaussian noise (the REML approach still works well for non-Gaussian errors in practice). The upshot is that this early stopping criterion is not just heuristic – it provably matches what we would choose if we had oracle knowledge of the bias–variance tradeoff.
One can draw parallels to classic results: it’s well-known that for linear regression with gradient descent, early stopping is equivalent to ridge regression (L2) with $\lambda \approx 1/(t\eta)$. Here the authors extend that spectral/dynamical analysis far beyond fixed linear models to the NTK kernel of a deep net. The “risky” (small eigenvalue) components are essentially being regularized by how long we train: short training heavily penalizes them (large implicit $\lambda$), long training relaxes the penalty.
Related Work.
This paper sits at the intersection of deep learning and mixed-effects modeling. Previous work by Simchoni and Rosset (JMLR 2023) also brings random-effects ideas into deep nets, but in a different way. They focused on data correlations—e.g., spatial or batch effects—and treated certain hidden features as random effects, learning them alongside the net under a mixed-model likelihood. That approach is about modeling complex data dependencies. By contrast, Yao et al.’s contribution is about the training dynamics itself: even without any explicit correlated features, the gradient path can be seen as a random-effects inference. In that sense, the new work is complementary rather than duplicative. (Another related line of work at NeurIPS 2021 also uses random-effect layers to handle repeated measurements or categorical embeddings in networks.).
On the statistical side, the idea of testing whether a random-effect variance is zero (score tests) and selecting variance components by REML are classic techniques in biostatistics and econometrics. The novelty here is applying these ideas to neural network training. It builds on the recent trend of connecting over-parameterized deep learning with Gaussian processes and kernel methods. And while the NTK regime is an asymptotic perspective, it often provides qualitatively good guidance.
Experiments and Practical Implications.
Although the derivations assume infinite width and squared loss, the authors also validate the ideas empirically. On synthetic data experiments, the variance-component test behaved as intended (correct Type I error when no signal, high power when there is a signal), and the REML stopping found training durations that gave near-optimal accuracy. Compared to the usual approach of holding out a validation set, the REML rule used all data for fitting and still matched or slightly outperformed the validation-based early stop, while needing no extra compute beyond solving the spectral equation.
As a real-world example, the paper applies the method to a UK Biobank proteomics dataset. There, both the test and the REML stop proved useful: the test avoided needless training on purely noisy label patterns, and when signal was present it found a stopping time that delivered competitive predictions. Overall, they report that the REML-guided approach saved a significant amount of computation (fewer epochs) with little to no loss in accuracy compared to standard tuning.
For a practitioner, this means two actionable takeaways. First, one could incorporate a quick score test after a bit of training (or even analytically) to decide if continuing training is worth the time. Second, if it is, one can steer clear of large validation sets: instead compute the REML criterion from the network’s NTK (or an approximation) to pick the stopping point. While doing that exactly may require kernel eigen-decompositions, in practice even approximate methods or early iterations might suffice, given the theoretical robustness. The key philosophical shift is treating the training end-point as an inferential parameter to estimate, rather than an arbitrary hyperparameter.
Discussion.
This work reframes deep learning training through the lens of classical statistics. By equating gradient flow with random-effects estimation, it bridges optimization and inference. One limitation is that the theory strictly relies on the NTK regime (very wide nets, small learning rate, etc.) and on regression-type losses. In practical finite networks or classification tasks, the equivalence is only approximate. Yet the qualitative insight – that training time regularizes toward fitting signal vs noise – certainly persists in many settings. Future work could test how well the REML stopping works for, say, neural nets trained on image data or with cross-entropy loss, and how robust the score test is under model misspecification.
Overall, this paper provides a new statistical toolkit for training decisions in deep learning. It reminds us that what looks like mere optimization trajectory can hide an implicit inference problem. By recognizing this, we gain principled ways to ask “Is more training actually improving our estimate of the true function? When have we squeezed out all the learnable structure?” In embodied AI and beyond, such questions are crucial: over-training can waste compute and encode noise, while under-training leaves valuable structure unexploited. Yao et al. give us a formal framework to make those decisions, backed by theory that this stopping rule is often the best one could hope for.
References: The main ideas and results here come from Yao et al., “Deep Neural Network Training as Random Effects: An Optimization-Inference Duality” (2026). For context, see Simchoni and Rosset (2023) on integrating random effects in DNNs. The empirical findings (score-test validity and REML performance) are also reported in Yao et al.