Is the Fast Walsh Hadamard transform (FWHT) a poor fit for current GPUs?
Why plain FWHT often performs poorly on GPUs
-
Very low arithmetic intensity for the naïve form.
The standard FWHT is a sequence of log₂N butterfly stages where each element participates in a small number of adds/subtracts per stage. That’s a small number of FLOPs per memory access, so kernels tend to be memory-bound on modern GPUs rather than compute-bound. This is the main reason the straightforward implementation doesn’t saturate GPU compute. ResearchGateEstudo Geral -
Strided / non-coalesced memory access across stages.
Each stage uses different strides. If implemented directly in global memory this causes non-coalesced loads/stores or many bank-conflicts when using shared memory, hurting throughput. Papers implementing FWHT on GPUs spend most effort on reorganizing data to avoid these costs. Estudo Geraleprints.ugd.edu.mk -
Frequent synchronization and small kernels.
The log-stage structure forces synchronization between stages (or multiple kernel launches). That creates overhead when the per-stage work is small (small N or few batched transforms). GPUs prefer large, dense operations that keep all SMs busy. ResearchGate -
Lack of a vendor-tuned library primitive.
Unlike FFT (cuFFT) there’s no widely-available, heavily-tuned vendor FWHT in major GPU libraries. That means users reimplement (and often get suboptimal kernels) or rely on community code. Recent papers explicitly note cuFFT has no real-to-real DHT/FWHT support. arXiv -
Hardware trends: tensor cores and large matrix throughput.
Modern GPUs give huge throughput for matrix-multiply style ops (Tensor Cores, systolic units). The FWHT’s simple add/sub patterns don’t map directly to tensor cores unless you reformulate the algorithm into blocked matrix multiplies — which recent work does. This mismatch explains why naïve FWHT underperforms relative to what the chip can do if you use matmuls. arXiv
What researchers and implementers have done (short survey)
-
Shared-memory, bank-aware tiling: early GPU FWHT implementations use shared memory tiles and carefully arrange data to minimize bank conflicts and achieve decent speedups vs naive global-memory kernels. These are useful but still limited by memory traffic. Estudo Geraleprints.ugd.edu.mk
-
Batched transforms: batching many small transforms together amortizes kernel launch and synchronization overhead and is a practical way to reach good utilization. ResearchGate
-
Reformulation to use Tensor Cores / matmul primitives: very recent work (HadaCore, Dec 2024) shows transforming FWHT to a blocked, tensor-core-friendly kernel yields substantial speedups on A100/H100 compared to classical GPU FWHT kernels — demonstrating that with the right mapping FWHT can be fast on modern GPUs. arXiv
-
Domain-specific hardware or in-memory approaches: for extreme cases (very high throughput, streaming), people investigate PIM / memristor or FPGA implementations for DHT/FWHT where the memory-bound nature is handled differently. NSF Pubs
Practical takeaway — when the claim “FWHT is a poor fit for GPUs” is fair
-
The claim is fair if you mean “the straightforward, stage-by-stage FWHT implemented naively is a poor fit for current GPUs”. For small N, non-batched workloads, or code that does direct global-memory butterflies, GPUs will be underutilized. Estudo Geral
-
The claim is overstated if you allow algorithmic rewrites: with tiling, batching, careful shared-memory use, warp shuffles, or by rephrasing parts of the transform as matrix multiplications (so they can run on Tensor Cores), you can get performance that uses the GPU well — recent work demonstrates this. In other words: not a hardware limitation so much as a mapping/data-layout problem — solve the mapping and GPUs perform. arXiv+1
Recommendations (if you need to implement FWHT on GPUs)
-
If you can, batch many transforms together so each kernel has enough work to amortize overhead. (Batched is often the single most important trick.) ResearchGate
-
Tile the computation: load tiles into shared memory, perform multiple butterfly stages there (reduces global traffic), then write back. Pay attention to bank conflicts. Estudo Geral
-
For NVIDIA: consider warp-shuffle intrinsics for small reductions inside a warp to avoid shared memory and reduce synchronization.
-
For maximal throughput on recent hardware, reformulate to block-matrix operations so you can use Tensor Cores or cuBLAS style matmul kernels — see HadaCore for a canonical example. This often gives the best absolute throughput on A100/H100. arXiv
-
If latency is tiny (very small N) or your transforms are highly irregular, consider CPU (vectorized) or FPGA/PIM alternatives — sometimes they win due to lower overhead and better support for small/irregular workloads. NSF Pubs
Concrete references (read next)
-
HadaCore: Tensor Core Accelerated Hadamard Transform Kernel (Dec 2024) — shows how to get high throughput by remapping FWHT to tensor cores. arXiv+1
-
A 2014–2015 line of work on optimized FWHT on GPUs — shared memory, bank conflict avoidance, tiling strategies (useful background/techniques). Estudo GeralResearchGate
-
Recent arXiv (Aug 2025) CountSketch paper noting FWHT isn’t in cuFFT and describing practical implementations for sketches — useful for applied ML contexts. arXiv
-
Older survey/papers on GPU spectral transforms that discuss memory-bound nature and reformulations. ResearchGate
Comments
Post a Comment