Random Features in Machine Learning and Neural Networks

1. Early ideas: randomness in function approximation (1960s–1990s)
Before “random features” was a buzzword, researchers were already using randomness in basis functions to approximate complex functions efficiently.
-
Random kitchen sinks before they had a name: In signal processing and approximation theory, you can represent a function as a linear combination of basis functions (Fourier, wavelets, etc.). The idea of choosing some basis functions randomly was explored for Monte Carlo integration and statistical estimation.
-
Random projections in statistics: Johnson & Lindenstrauss (1984) showed that you can project high-dimensional data into a much lower dimension with a random linear mapping while preserving pairwise distances up to small error (the JL lemma). This became the theoretical bedrock for a lot of random feature work.
-
Neural networks: In the late ’80s–early ’90s, there was a wave of research on random weights in the first layer, with only the output layer trained — sometimes called "fixed random basis" networks. This was motivated by the universal approximation theorem: even with random hidden units, you can approximate any continuous function by adjusting only the output weights.
2. The kernel connection: Rahimi & Recht (2007–2008)
The modern Random Features term comes from Ali Rahimi & Ben Recht in their “Random Kitchen Sinks” papers:
-
This connected random projections directly to kernel approximation and opened a floodgate of research.
3. Locality Sensitive Hashing (LSH) as random features
-
LSH origin: Developed in the late ’90s (Indyk & Motwani, 1998) for sublinear-time nearest neighbor search.
-
How it works: Hash functions are chosen so that nearby points collide in the same hash bucket with high probability (e.g., for cosine similarity, random hyperplanes are used).
-
View as random features: Each LSH bit can be treated as a binary random feature. For example:
with w random Gaussian — this is essentially a very coarse random projection followed by quantization. -
Kernel link: The probability that two vectors share the same LSH hash is a function of their similarity; thus, LSH implicitly defines a similarity kernel. Using the LSH outputs as features is like approximating that kernel.
4. Random projections as a special case of random features
5. Random features in neural networks
Several strands emerged:
-
Extreme Learning Machines (ELM) (Huang et al., 2006) — single hidden layer NN with random weights, only train output layer.
-
Reservoir Computing (Jaeger, 2001; Maass, 2002) — large, fixed, randomly connected recurrent network; train only the readout.
-
Random convolutional features — randomly initialized convolution filters (often untrained) can still give surprisingly good representations if followed by pooling and a trained linear layer.
-
Neural Tangent Kernel (NTK) viewpoint — as width → ∞, random weights correspond to a fixed kernel; training only the last layer is kernel regression with that kernel.
6. Structured & fast random features
A bottleneck in early work was computing for dense . Solutions:
-
Fastfood transform (Le et al., 2013) — replace dense Gaussian with products of Hadamard, diagonal, and permutation matrices to speed up projection to O(dlogd).
-
Orthogonal random features — improve approximation quality for the same number of features.
-
Circulant and convolutional projections — leverage FFT for speed.
7. Applications & uses
-
Kernel approximation — SVMs, Gaussian processes, spectral clustering.
-
Approximate nearest neighbor search — LSH-based features.
-
Dimensionality reduction — pre-processing step for large datasets.
-
Streaming / low-memory learning — fixed-size random feature maps allow online learning without storing large kernel matrices.
-
Privacy — random projections can obfuscate individual coordinates while retaining distances.
8. Useful conceptual connections
-
Compressed sensing — uses random projections to recover sparse signals; similar mathematics to random features but for reconstruction, not classification.
-
Johnson–Lindenstrauss lemma — the theoretical guarantee for distance-preserving random mappings.
-
Sketching in numerical linear algebra — random projections for matrix approximation; directly related to the same math.
-
Hash kernels in NLP — “feature hashing” maps high-dimensional sparse features into a lower-dimensional space via pseudo-random hash functions, similar in spirit to LSH as features.
Comments
Post a Comment