Using Extreme Learning Machines as a Data Sponge
Extreme Learning Machines (ELMs) as the “sponge”
ELMs (and similar fixed-feature random projection networks) already have these sponge properties:
-
Fast training — weights in the hidden layer are fixed random projections; only the output weights are learned (often by linear regression).
-
Good interpolation in the training region, if the random features span the relevant subspace.
-
Low complexity — less tuning, smaller training cost.
If your “sponge” role really is just to store associative patterns and map them into a useful feature space, then an ELM (or random Fourier features, or kernel approximations) can do that far more cheaply than a full backprop-trained deep net.
The trade-off:
-
The feature mapping is fixed, so you lose the adaptive representation learning that deep nets excel at.
-
You may need a lot of random features to match the coverage of a trained net, which can inflate memory.
-
Without adaptation, robustness to domain shift can be worse unless the random mapping is very high-dimensional.
2. Engineering extrapolation into the system
If you swap a dense ReLU net for an ELM, the extrapolation still has to come from somewhere. Options include:
-
External retrieval/memory (RAG, vector databases) — doesn’t require backprop through the memory.
-
Symbolic or algorithmic modules — e.g., planners, calculators, simulators.
-
Learned controllers or routers that select modules — these can be trained without full backprop into the sponge, using reinforcement learning or policy gradients.
-
Feature-space composition — e.g., combining ELM features with structured representations (graphs, trees) before feeding into downstream algorithms.
3. When backprop through the whole system matters
You’re right — in many state-of-the-art systems, the engineered modules are differentiable and trained end-to-end:
-
Vision–language models: image encoder + text decoder + cross-attention are all trained together.
-
Tool-augmented LLMs with differentiable retrievers: the retriever is fine-tuned via gradients from the language loss.
-
Reinforcement learning with differentiable environments (rare, but exists).
End-to-end backprop allows the neural core to co-adapt its representation to the needs of the surrounding modules, which often improves efficiency and sample complexity.
If you replace the core with a fixed-feature ELM, you lose this co-adaptation — the surrounding modules have to work with whatever feature space the ELM gives them.
4. Hybrid compromise
One possible hybrid is:
-
Cheap, partially trainable sponge:
-
Randomized features for most of the projection
-
A small trainable bottleneck layer for adaptation
-
-
Non-differentiable extrapolation modules:
-
Retrieval, simulation, symbolic reasoning
-
-
Occasional fine-tuning of the bottleneck or output weights using a meta-learning loop, rather than continuous backprop through everything.
This keeps the sponge inexpensive, but lets the engineered shell do the heavy extrapolation lifting.
Comments
Post a Comment