Double Descent and Extreme Learning Machines

August 08, 2025

Double Descent and Extreme Learning Machines

An Extreme Learning Machine with a fixed random layer followed by a linear output layer.

What is "Double Descent" – Explained Simply

Double descent curve for the linear layer

Recent research shows that something unexpected happens when you keep increasing model size past the point where it exactly fits the training data—this is called the interpolation threshold. Beyond this point:

Test error starts to drop again.
The model generalizes better, even though it's massively overparameterized.

This gives us a second "dip" in the test error curve—hence the name "double descent."

📊 How This Relates to Extreme Learning Machines (ELMs)

Let’s now bring in Extreme Learning Machines, which are a type of neural network with:

One hidden layer,
Randomly chosen input-to-hidden weights (they are not trained),
Only the output layer (linear layer) is learned using something like the Moore-Penrose inverse or ridge regression.

Think of the ELM as a linear model applied to random features.

🧱 Number of Parameters and Double Descent in ELMs

Here’s how double descent happens in ELMs:

Few hidden units (few parameters):
- The model is too simple → can’t capture patterns well → high test error.
Hidden units = number of training samples:
- The model becomes just powerful enough to fit the training data exactly → this is the interpolation point.
- But it also fits noise → test error increases (the first rise).
Many more hidden units (lots of parameters):
- Surprisingly, test error drops again.
- Why? Because the solution chosen (usually the minimum norm or regularized one) tends to be smoother and generalizes better, even though the model is massively overparameterized.

So the error curve goes:

High → Low → High → Low
as you increase the number of hidden units in the ELM.

🔑 Key Idea

Even though ELMs are simple and don’t train their hidden layers, they still show double descent because:

The output layer is a linear model, applied to a growing number of features (random nonlinear transforms).
When the number of hidden units increases, you effectively have more input features for that final linear layer.
This behaves just like a linear regression model with increasing dimensionality—which is known to show double descent.

📘 In One Sentence

Double descent happens in ELMs because increasing the number of hidden units makes the model go from underfitting (moderate input noise sensitivity)→ perfect fitting (high input noise sensitivity) → then better generalization again (low input noise sensitivity) , just like in modern deep learning, but through a linear model on random features.

📚 Want to Read More?

You can check out:

"Reconciling modern machine learning and the bias-variance trade-off" – Belkin et al. (2019)
"Double Descent in Random Feature Models and ELMs" – Various 2020–2023 ML workshop papers
ELM theory from G.-B. Huang et al., often discusses generalization in overparameterized settings.

Further information:

https://sites.google.com/view/algorithmshortcuts/weighted-sum-info-storage

Search This Blog

Science Limelight