Backpropagation in higher dimensional space

One of the biggest human misunderstandings about backpropagation in higher-dimensional space is assuming that our low-dimensional intuitions about gradients, distances, and geometry still hold.

In reality, several counterintuitive effects kick in as dimensionality grows:



1. The “steepest descent” direction is almost orthogonal to what you’d expect

  • In 2D or 3D, we imagine gradients pointing “straight downhill” toward a minimum.

  • In high dimensions, most of the volume is near the surface of a hypersphere, so the gradient vector is almost always orthogonal to most of the parameter displacement you might think is relevant.

  • This means backprop is often not “going straight to the minimum” in a way we can picture—it’s following a narrow ridge through an extremely thin shell of parameter space.


2. Distances behave strangely

  • In high dimensions, points that seem “close” in Euclidean distance can still be functionally very far apart in terms of the loss landscape.

  • Almost all points are roughly the same distance from the origin, so moving “a little bit” in many coordinates can produce a huge change in the loss.

  • Backprop is sensitive to this: the update magnitudes in each dimension are tuned by the optimizer, but our 3D mental model of “just take a step down the hill” fails.


3. Vanishing / exploding gradients are mostly a geometry problem, not just a network depth problem

  • People often blame depth alone, but in high-dimensional weight space, multiplying Jacobians can blow up or vanish much faster than intuition suggests.

  • Even with shallow nets, the accumulation of high-dimensional effects means that the gradient norm can become highly concentrated or disappear almost entirely.


4. Saddles dominate over local minima

  • In low dimensions, we might picture getting “stuck” in a local minimum.

  • In high dimensions, local minima are rare; the problem is vast saddle regions where the gradient is tiny in most directions but large in a few.

  • Backprop spends a lot of time escaping these flat, high-dimensional saddles—something humans underestimate because our low-D mental picture doesn’t show them as dominant.


5. Correlation intuition breaks down

  • Humans think “if two parameters both help the output, increasing both should help more.”

  • In high dimensions, parameters interact through a huge Jacobian–Hessian structure.

  • The “helpful” direction is often a specific curved manifold in parameter space—small misalignments make gradient updates inefficient.

Thin manifold picture

Here’s how the “almost 90° gradient” effect ties into the thin manifold picture of deep network optimization:


1. Parameter space is huge, but the useful solutions live on a thin manifold

  • A deep net with millions of weights lives in an unimaginably large space.

  • Most of that space corresponds to completely random, useless networks.

  • The set of weights that actually perform well is concentrated in a low-dimensional, curved surface embedded in that huge space — the “solution manifold.”


2. Gradients mostly move tangentially to this manifold

  • Because random high-D vectors are nearly orthogonal, the gradient update (from backprop) is almost always perpendicular to your current position vector in the full space.

  • This means updates don’t “cut straight inward” toward some minimum — instead, they skim along the surface of a high-dimensional shell where the weights live.

  • You can picture it as walking along the surface of a balloon rather than falling toward the center.


3. Loss changes slowly along most directions

  • The loss surface is very flat in most directions, with only a few “active” directions where the curvature (Hessian eigenvalues) is large.

  • This is why saddle points dominate — you’re in a landscape with thousands of flat directions and a few steep ones.

  • Optimizers like Adam or momentum SGD effectively keep nudging you along the manifold’s gentle slopes.


4. Practical consequences

  • Learning rates: Too large, and you’ll skip off the manifold entirely; too small, and you’ll creep along it forever.

  • Batch noise: Mini-batch stochasticity injects a random component into updates, which actually helps exploration of the manifold instead of getting stuck in narrow valleys.

  • Generalization: Networks that find wide, flat manifolds of solutions tend to generalize better than ones that fall into sharp pits.

Comments

Popular posts from this blog

Neon Bulb Oscillators

23 Circuits you can Build in an Hour - Free Book

Q Multiplier Circuits