Fast Walsh Hadamard Transform (out-of-place, fixed A-stride)

August 14, 2025

Fast WHT Algorithm (out-of-place, fixed A-stride)

Fast Walsh Hadamard transform (out-of-place):

Let $n=2^m$ . For each stage $t = 1, \dots, m:$

Read $A$ in adjacent pairs: $(A[0],A[1]),(A[2],A[3]),\ldots,(A[n-2],A[n-1])$ .
Write sums sequentially into $B[0],B[1],\ldots,B[n/2-1]$ .
Write diffs sequentially into $B[n/2],B[n/2+1],\ldots,B[n-1]$ .
Swap $A\leftrightarrow B$ .

Equivalently, the read stride is always 1; the write pointers are two linear cursors: wSum starting at 0 and wDiff starting at $n/2$ .

Practical notes & perks

Cache/stream-friendly: Writing all sums contiguously and all diffs contiguously improves write combining and sequential bandwidth versus interleaving.
Vectorization: The inner loop is a pure stride-1 read with two monotone write cursors — perfect for SIMD and wide loads/stores.
2*memory: Uses twice the memory of the in-place algorithm which will either increase cache memory usage or slow the algorithm down for wide inputs.
Mixed strategy: Optimal algorithms for the WHT use mixed strategies to obtain maximum compute speed by combining naive, out-of-place and in-place operations.

Comments