Backpropagation in higher dimensional space
One of the biggest human misunderstandings about backpropagation in higher-dimensional space is assuming that our low-dimensional intuitions about gradients, distances, and geometry still hold.
In reality, several counterintuitive effects kick in as dimensionality grows:
1. The “steepest descent” direction is almost orthogonal to what you’d expect
-
In 2D or 3D, we imagine gradients pointing “straight downhill” toward a minimum.
-
In high dimensions, most of the volume is near the surface of a hypersphere, so the gradient vector is almost always orthogonal to most of the parameter displacement you might think is relevant.
-
This means backprop is often not “going straight to the minimum” in a way we can picture—it’s following a narrow ridge through an extremely thin shell of parameter space.
2. Distances behave strangely
-
In high dimensions, points that seem “close” in Euclidean distance can still be functionally very far apart in terms of the loss landscape.
-
Almost all points are roughly the same distance from the origin, so moving “a little bit” in many coordinates can produce a huge change in the loss.
-
Backprop is sensitive to this: the update magnitudes in each dimension are tuned by the optimizer, but our 3D mental model of “just take a step down the hill” fails.
3. Vanishing / exploding gradients are mostly a geometry problem, not just a network depth problem
-
People often blame depth alone, but in high-dimensional weight space, multiplying Jacobians can blow up or vanish much faster than intuition suggests.
-
Even with shallow nets, the accumulation of high-dimensional effects means that the gradient norm can become highly concentrated or disappear almost entirely.
4. Saddles dominate over local minima
-
In low dimensions, we might picture getting “stuck” in a local minimum.
-
In high dimensions, local minima are rare; the problem is vast saddle regions where the gradient is tiny in most directions but large in a few.
-
Backprop spends a lot of time escaping these flat, high-dimensional saddles—something humans underestimate because our low-D mental picture doesn’t show them as dominant.
5. Correlation intuition breaks down
-
Humans think “if two parameters both help the output, increasing both should help more.”
-
In high dimensions, parameters interact through a huge Jacobian–Hessian structure.
-
The “helpful” direction is often a specific curved manifold in parameter space—small misalignments make gradient updates inefficient.
Thin manifold picture
Here’s how the “almost 90° gradient” effect ties into the thin manifold picture of deep network optimization:
1. Parameter space is huge, but the useful solutions live on a thin manifold
-
A deep net with millions of weights lives in an unimaginably large space.
-
Most of that space corresponds to completely random, useless networks.
-
The set of weights that actually perform well is concentrated in a low-dimensional, curved surface embedded in that huge space — the “solution manifold.”
2. Gradients mostly move tangentially to this manifold
-
Because random high-D vectors are nearly orthogonal, the gradient update (from backprop) is almost always perpendicular to your current position vector in the full space.
-
This means updates don’t “cut straight inward” toward some minimum — instead, they skim along the surface of a high-dimensional shell where the weights live.
-
You can picture it as walking along the surface of a balloon rather than falling toward the center.
3. Loss changes slowly along most directions
-
The loss surface is very flat in most directions, with only a few “active” directions where the curvature (Hessian eigenvalues) is large.
-
This is why saddle points dominate — you’re in a landscape with thousands of flat directions and a few steep ones.
-
Optimizers like Adam or momentum SGD effectively keep nudging you along the manifold’s gentle slopes.
4. Practical consequences
-
Learning rates: Too large, and you’ll skip off the manifold entirely; too small, and you’ll creep along it forever.
-
Batch noise: Mini-batch stochasticity injects a random component into updates, which actually helps exploration of the manifold instead of getting stuck in narrow valleys.
-
Generalization: Networks that find wide, flat manifolds of solutions tend to generalize better than ones that fall into sharp pits.
Comments
Post a Comment