Can Neural Networks alone extrapolate or only interpolate?

Why do dense ReLU-activated neural networks—which in essence combine lots of linear (weighted-sum) interpolation with only a few points of non-linearity (the "hinges" of ReLU)—perform so well in practice? And is this limited but focused non-linearity a feature or a limitation? Especially as the non-linearity per parameter ratio decreases with layer width.

Here's what the current theory and empirical research reveal:

1. Universal Approximation and the Role of Non-linearity

Due to the Universal Approximation Theorem, even a network with just one hidden layer and a non-polynomial activation (like ReLU) can approximate any continuous function arbitrarily well, assuming it's sufficiently wide or deep Wikipedia. However, this theorem is an existence guarantee only—it doesn’t say how to train such networks, nor how many neurons you actually need in practice.

This shows that dense networks with ReLU do have the expressive capacity—but training them effectively (the actual path to those weights) is the real challenge.

2. Interpolation vs. Extrapolation: High Dimensions and Generalization

A key insight comes from the paper “Learning in High Dimension Always Amounts to Extrapolation” by Balestriero, Pesenti, and LeCun. They argue that in high-dimensional spaces (e.g., dimensions > 100), nearly all test samples lie outside the convex hull of the training data—meaning interpolation almost never occurs. Instead, what we call “interpolation” is actually extrapolation in the strict geometric sense arXivMedium.

This shifts the narrative dramatically: modern neural nets generalize despite operating primarily in extrapolation regimes—not because of interpolation. Reddit commentators reflect this shift:

“Yann LeCun thinks that it's specious to say neural network models are interpolating because in high dimensions, everything is extrapolation.” Reddit

So, the weight‐sum interpolation capability being widely cited may not really explain why these networks generalize.

3. Overparameterization, Smooth Interpolants, and Double Descent

Another angle is the behavior of overparameterized systems. A study on weighted trigonometric interpolation shows that:

Overparameterized models can achieve lower generalization error by favoring smooth interpolants, sometimes outperforming underparameterized models arXiv.

This aligns with the well-known double-descent phenomena: increasing parameters past the point of zero training error can further reduce test error, especially when the model implicitly biases toward smoother functions.

4. Extrapolation of Neural Networks: When Does It Work (and When Not)?

There's been research—both theoretical and empirical—into when neural nets extrapolate well:

One study shows that in the Neural Tangent Kernel (NTK) regime, two-layer ReLU networks trained on a linear target function can converge to the correct linear mapping—so they generalize (extrapolate) if the training distribution is “diverse enough” researchain.net.
But generally, MLPs struggle to extrapolate non-linear functions, especially when the task deviates strongly from linearity researchain.net.
In a more applied context (e.g., physics-informed networks), failure to extrapolate often arises not from high-frequency components per se, but from shifts in the data distribution’s spectral support arXiv.
In sensor modeling or photonics, models built with PCA + DNN interpolate well within training bounds—but fail to extrapolate, sometimes worse than a pure DNN approach American Chemical Society Publications.

In sum, successful extrapolation depends heavily on the structure of the target function, the architecture, and how well the training distribution covers relevant variations.

5. The "Weighted-Sum Interpolation" Insight: Help or Hindrance?

Why it helps (in part):

Sparse, piecewise-linear approximations (courtesy of ReLU hinges) offer enough flexibility to approximate complex functions efficiently.
Overparameterization allows the network to fit training data while implicitly biasing toward smoother or simpler interpolants—even when extrapolating in practice.

Why it may be a limitation:

The limited non-linearity means networks struggle with behaviors that require more expressive extrapolation—such as periodic functions, or functions with strong global structure not seen during training.
The extrapolation regime predominates, so relying solely on local interpolation-like assumptions is insufficient—yet there's no guarantee the network can learn the correct behavior outside the train distribution.

Summary Table

Aspect	Benefit	Limitation
Universal Approximation	Dense ReLU nets can approximate any continuous function	Does not guarantee effective training or necessary size
Piecewise Linearity	Simple, efficient interpolation between learned "hinges"	May fail to extrapolate complex structures or periodicity
High-dimensional behavior	Generalization occurs even in extrapolation regimes	Interpolation assumptions often invalid in high-D
Overparameterization bias	Encourages smooth generalization, improves generalization	Still limited when covering unseen data or distribution shifts
Extrapolation capability	Sometimes succeeds on linear/diverse functions	Often fails outside training—even with PDE-informed networks

Final Thoughts

The success of dense ReLU networks is partially due to their capacity to do a lot with relatively simple piecewise linear interpolation—but this isn't the whole story.
Their success despite operating in extrapolation regimes (not interpolation) is a surprising finding that challenges traditional intuitions about generalization.
Overparameterization and the network’s implicit inductive biases (like smoothness) further help in generalizing—even in extrapolation territories.
Yet, true extrapolation remains fragile and heavily dependent on architecture, activation, training diversity, and whether training and test distributions align in more subtle ways.

Search This Blog

Science Limelight