Random Features in Machine Learning and Neural Networks

August 12, 2025

1. Early ideas: randomness in function approximation (1960s–1990s)

Before “random features” was a buzzword, researchers were already using randomness in basis functions to approximate complex functions efficiently.

Random kitchen sinks before they had a name: In signal processing and approximation theory, you can represent a function as a linear combination of basis functions (Fourier, wavelets, etc.). The idea of choosing some basis functions randomly was explored for Monte Carlo integration and statistical estimation.
Random projections in statistics: Johnson & Lindenstrauss (1984) showed that you can project high-dimensional data into a much lower dimension with a random linear mapping while preserving pairwise distances up to small error (the JL lemma). This became the theoretical bedrock for a lot of random feature work.
Neural networks: In the late ’80s–early ’90s, there was a wave of research on random weights in the first layer, with only the output layer trained — sometimes called "fixed random basis" networks. This was motivated by the universal approximation theorem: even with random hidden units, you can approximate any continuous function by adjusting only the output weights.

2. The kernel connection: Rahimi & Recht (2007–2008)

The modern Random Features term comes from Ali Rahimi & Ben Recht in their “Random Kitchen Sinks” papers:

This connected random projections directly to kernel approximation and opened a floodgate of research.

3. Locality Sensitive Hashing (LSH) as random features

LSH origin: Developed in the late ’90s (Indyk & Motwani, 1998) for sublinear-time nearest neighbor search.
How it works: Hash functions are chosen so that nearby points collide in the same hash bucket with high probability (e.g., for cosine similarity, random hyperplanes are used).
View as random features: Each LSH bit can be treated as a binary random feature. For example:

with w random Gaussian — this is essentially a very coarse random projection followed by quantization.
Kernel link: The probability that two vectors share the same LSH hash is a function of their similarity; thus, LSH implicitly defines a similarity kernel. Using the LSH outputs as features is like approximating that kernel.

4. Random projections as a special case of random features

5. Random features in neural networks

Several strands emerged:

Extreme Learning Machines (ELM) (Huang et al., 2006) — single hidden layer NN with random weights, only train output layer.
Reservoir Computing (Jaeger, 2001; Maass, 2002) — large, fixed, randomly connected recurrent network; train only the readout.
Random convolutional features — randomly initialized convolution filters (often untrained) can still give surprisingly good representations if followed by pooling and a trained linear layer.
Neural Tangent Kernel (NTK) viewpoint — as width → ∞, random weights correspond to a fixed kernel; training only the last layer is kernel regression with that kernel.

6. Structured & fast random features

A bottleneck in early work was computing $R x$ for dense $R$ . Solutions:

Fastfood transform (Le et al., 2013) — replace dense Gaussian with products of Hadamard, diagonal, and permutation matrices to speed up projection to O(dlog⁡d).
Orthogonal random features — improve approximation quality for the same number of features.
Circulant and convolutional projections — leverage FFT for speed.

7. Applications & uses

Kernel approximation — SVMs, Gaussian processes, spectral clustering.
Approximate nearest neighbor search — LSH-based features.
Dimensionality reduction — pre-processing step for large datasets.
Streaming / low-memory learning — fixed-size random feature maps allow online learning without storing large kernel matrices.
Privacy — random projections can obfuscate individual coordinates while retaining distances.

8. Useful conceptual connections

Compressed sensing — uses random projections to recover sparse signals; similar mathematics to random features but for reconstruction, not classification.
Johnson–Lindenstrauss lemma — the theoretical guarantee for distance-preserving random mappings.
Sketching in numerical linear algebra — random projections for matrix approximation; directly related to the same math.
Hash kernels in NLP — “feature hashing” maps high-dimensional sparse features into a lower-dimensional space via pseudo-random hash functions, similar in spirit to LSH as features.

Search This Blog

Science Limelight

Random Features in Machine Learning and Neural Networks

1. Early ideas: randomness in function approximation (1960s–1990s)

2. The kernel connection: Rahimi & Recht (2007–2008)

3. Locality Sensitive Hashing (LSH) as random features

4. Random projections as a special case of random features

5. Random features in neural networks

6. Structured & fast random features

7. Applications & uses

8. Useful conceptual connections

Comments

Post a Comment

Popular posts from this blog

Neon Bulb Oscillators

23 Circuits you can Build in an Hour - Free Book

Q Multiplier Circuits