ReLU Networks as Switch-Based Linear Systems During SGD Training

Introduction

When we think about the ReLU activation function, the common viewpoint is geometric — it projects the input into a half-space, zeroing out the negative part.

But let’s adopt a more “circuit-like” perspective: treat ReLU as a switch. In the forward pass, each ReLU is either “on” (passing the input through linearly) or “off” (outputting zero).


Forward Pass: Switching Decisions Become Fixed
Once you fix an input sample, the sign pattern at each hidden unit is determined. The state of each ReLU — on or off — is completely known for that forward pass.
At this point, the network reduces to a purely linear mapping:

  • Only the active ReLUs propagate signals.

  • The inactive ones are like open circuits — they disconnect that part of the graph.

For a given input, you can literally write the network as:

y=WL(a)WL1(a)W1(a)xy = W_L^{(a)} W_{L-1}^{(a)} \dots W_1^{(a)} x

where the superscript (a)(a) means “mask out the rows or columns corresponding to inactive ReLUs.”


Backpropagation: Linear System Under Known Switch States
Now comes the key point: during backpropagation, with the switching pattern fixed, you are updating the weights of a completely linear system.
SGD is not “fighting” the nonlinearity — it’s adjusting a set of chained linear maps with known connectivity for this sample.
This means:

  • Gradients flow only through the active subgraph.

  • The weight updates occur along specific active paths from the output back to the input layer.


Low-Magnitude Initialization: Pathway Reinforcement
Suppose you initialize the weights with low magnitude. Early on:

  • The ReLU on/off decisions are somewhat random, often driven by small random deviations in pre-activations.

  • Once a particular pattern happens for a given input, gradients only travel through that pattern.

  • Because the network is linear within that pattern, SGD will strengthen the magnitudes along those active edges.

This has a pathway reinforcement effect:

  • Larger weights along a path increase pre-activations in the same direction for similar inputs in the future.

  • This makes it more likely that the same ReLU states will occur, locking in the same switching pattern.

In other words, the network “commits” to certain active subgraphs early, and those subgraphs tend to persist.


Implications and Hypotheses

  1. Early phase = structure formation
    Training begins by picking and strengthening certain subgraphs based on small random fluctuations and input distribution.

  2. Later phase = fine-tuning linear maps
    Once switching patterns stabilize, the network is effectively training a set of linear systems for each “region” of the input space.

  3. Potential brittleness
    If early pathway choices are suboptimal, they could be reinforced and become harder to change — especially with small learning rates and without mechanisms like dropout or large initialization variance.  Though this scenario is unlikely as large neural networks are known to be dominated by saddle-points, not trapping local minima.

  4. Analogy to decision trees
    Like early splits in a tree that define all downstream decisions, early ReLU states shape the learning trajectory.


Closing Thought
From this perspective, ReLU networks under SGD are not mysterious nonlinear beasts all the time — for each input, they’re piecewise-linear devices whose “wiring diagram” is determined in the forward pass.
SGD then adjusts the linear weights along those wires, often reinforcing the very routing decisions that created those wires in the first place.
This viewpoint could help us better understand phenomena like dead ReLUs, feature map specialization, and training instability in deep sparse-activation regimes.

Comments

Popular posts from this blog

Neon Bulb Oscillators

23 Circuits you can Build in an Hour - Free Book

Q Multiplier Circuits