Random Features in Machine Learning and Neural Networks

 



1. Early ideas: randomness in function approximation (1960s–1990s)

Before “random features” was a buzzword, researchers were already using randomness in basis functions to approximate complex functions efficiently.

  • Random kitchen sinks before they had a name: In signal processing and approximation theory, you can represent a function as a linear combination of basis functions (Fourier, wavelets, etc.). The idea of choosing some basis functions randomly was explored for Monte Carlo integration and statistical estimation.

  • Random projections in statistics: Johnson & Lindenstrauss (1984) showed that you can project high-dimensional data into a much lower dimension with a random linear mapping while preserving pairwise distances up to small error (the JL lemma). This became the theoretical bedrock for a lot of random feature work.

  • Neural networks: In the late ’80s–early ’90s, there was a wave of research on random weights in the first layer, with only the output layer trained — sometimes called "fixed random basis" networks. This was motivated by the universal approximation theorem: even with random hidden units, you can approximate any continuous function by adjusting only the output weights.


2. The kernel connection: Rahimi & Recht (2007–2008)

The modern Random Features term comes from Ali Rahimi & Ben Recht in their “Random Kitchen Sinks” papers:



  • This connected random projections directly to kernel approximation and opened a floodgate of research.


3. Locality Sensitive Hashing (LSH) as random features

  • LSH origin: Developed in the late ’90s (Indyk & Motwani, 1998) for sublinear-time nearest neighbor search.

  • How it works: Hash functions are chosen so that nearby points collide in the same hash bucket with high probability (e.g., for cosine similarity, random hyperplanes are used).

  • View as random features: Each LSH bit can be treated as a binary random feature. For example:


    with w random Gaussian — this is essentially a very coarse random projection followed by quantization.

  • Kernel link: The probability that two vectors share the same LSH hash is a function of their similarity; thus, LSH implicitly defines a similarity kernel. Using the LSH outputs as features is like approximating that kernel.


4. Random projections as a special case of random features



5. Random features in neural networks

Several strands emerged:

  1. Extreme Learning Machines (ELM) (Huang et al., 2006) — single hidden layer NN with random weights, only train output layer.

  2. Reservoir Computing (Jaeger, 2001; Maass, 2002) — large, fixed, randomly connected recurrent network; train only the readout.

  3. Random convolutional features — randomly initialized convolution filters (often untrained) can still give surprisingly good representations if followed by pooling and a trained linear layer.

  4. Neural Tangent Kernel (NTK) viewpoint — as width → ∞, random weights correspond to a fixed kernel; training only the last layer is kernel regression with that kernel.


6. Structured & fast random features

A bottleneck in early work was computing RxR x for dense RR. Solutions:

  • Fastfood transform (Le et al., 2013) — replace dense Gaussian with products of Hadamard, diagonal, and permutation matrices to speed up projection to O(dlog⁡d).

  • Orthogonal random features — improve approximation quality for the same number of features.

  • Circulant and convolutional projections — leverage FFT for speed.


7. Applications & uses

  • Kernel approximation — SVMs, Gaussian processes, spectral clustering.

  • Approximate nearest neighbor search — LSH-based features.

  • Dimensionality reduction — pre-processing step for large datasets.

  • Streaming / low-memory learning — fixed-size random feature maps allow online learning without storing large kernel matrices.

  • Privacy — random projections can obfuscate individual coordinates while retaining distances.


8. Useful conceptual connections

  • Compressed sensing — uses random projections to recover sparse signals; similar mathematics to random features but for reconstruction, not classification.

  • Johnson–Lindenstrauss lemma — the theoretical guarantee for distance-preserving random mappings.

  • Sketching in numerical linear algebra — random projections for matrix approximation; directly related to the same math.

  • Hash kernels in NLP — “feature hashing” maps high-dimensional sparse features into a lower-dimensional space via pseudo-random hash functions, similar in spirit to LSH as features.



Comments

Popular posts from this blog

Neon Bulb Oscillators

23 Circuits you can Build in an Hour - Free Book

Q Multiplier Circuits