Why plain FWHT often performs poorly on GPUs

Very low arithmetic intensity for the naïve form.
The standard FWHT is a sequence of log₂N butterfly stages where each element participates in a small number of adds/subtracts per stage. That’s a small number of FLOPs per memory access, so kernels tend to be memory-bound on modern GPUs rather than compute-bound. This is the main reason the straightforward implementation doesn’t saturate GPU compute. ResearchGateEstudo Geral
Strided / non-coalesced memory access across stages.
Each stage uses different strides. If implemented directly in global memory this causes non-coalesced loads/stores or many bank-conflicts when using shared memory, hurting throughput. Papers implementing FWHT on GPUs spend most effort on reorganizing data to avoid these costs. Estudo Geraleprints.ugd.edu.mk
Frequent synchronization and small kernels.
The log-stage structure forces synchronization between stages (or multiple kernel launches). That creates overhead when the per-stage work is small (small N or few batched transforms). GPUs prefer large, dense operations that keep all SMs busy. ResearchGate
Lack of a vendor-tuned library primitive.
Unlike FFT (cuFFT) there’s no widely-available, heavily-tuned vendor FWHT in major GPU libraries. That means users reimplement (and often get suboptimal kernels) or rely on community code. Recent papers explicitly note cuFFT has no real-to-real DHT/FWHT support. arXiv
Hardware trends: tensor cores and large matrix throughput.
Modern GPUs give huge throughput for matrix-multiply style ops (Tensor Cores, systolic units). The FWHT’s simple add/sub patterns don’t map directly to tensor cores unless you reformulate the algorithm into blocked matrix multiplies — which recent work does. This mismatch explains why naïve FWHT underperforms relative to what the chip can do if you use matmuls. arXiv

What researchers and implementers have done (short survey)

Shared-memory, bank-aware tiling: early GPU FWHT implementations use shared memory tiles and carefully arrange data to minimize bank conflicts and achieve decent speedups vs naive global-memory kernels. These are useful but still limited by memory traffic. Estudo Geraleprints.ugd.edu.mk
Batched transforms: batching many small transforms together amortizes kernel launch and synchronization overhead and is a practical way to reach good utilization. ResearchGate
Reformulation to use Tensor Cores / matmul primitives: very recent work (HadaCore, Dec 2024) shows transforming FWHT to a blocked, tensor-core-friendly kernel yields substantial speedups on A100/H100 compared to classical GPU FWHT kernels — demonstrating that with the right mapping FWHT can be fast on modern GPUs. arXiv
Domain-specific hardware or in-memory approaches: for extreme cases (very high throughput, streaming), people investigate PIM / memristor or FPGA implementations for DHT/FWHT where the memory-bound nature is handled differently. NSF Pubs

Practical takeaway — when the claim “FWHT is a poor fit for GPUs” is fair

The claim is fair if you mean “the straightforward, stage-by-stage FWHT implemented naively is a poor fit for current GPUs”. For small N, non-batched workloads, or code that does direct global-memory butterflies, GPUs will be underutilized. Estudo Geral
The claim is overstated if you allow algorithmic rewrites: with tiling, batching, careful shared-memory use, warp shuffles, or by rephrasing parts of the transform as matrix multiplications (so they can run on Tensor Cores), you can get performance that uses the GPU well — recent work demonstrates this. In other words: not a hardware limitation so much as a mapping/data-layout problem — solve the mapping and GPUs perform. arXiv+1

Recommendations (if you need to implement FWHT on GPUs)

If you can, batch many transforms together so each kernel has enough work to amortize overhead. (Batched is often the single most important trick.) ResearchGate
Tile the computation: load tiles into shared memory, perform multiple butterfly stages there (reduces global traffic), then write back. Pay attention to bank conflicts. Estudo Geral
For NVIDIA: consider warp-shuffle intrinsics for small reductions inside a warp to avoid shared memory and reduce synchronization.
For maximal throughput on recent hardware, reformulate to block-matrix operations so you can use Tensor Cores or cuBLAS style matmul kernels — see HadaCore for a canonical example. This often gives the best absolute throughput on A100/H100. arXiv
If latency is tiny (very small N) or your transforms are highly irregular, consider CPU (vectorized) or FPGA/PIM alternatives — sometimes they win due to lower overhead and better support for small/irregular workloads. NSF Pubs

Concrete references (read next)

HadaCore: Tensor Core Accelerated Hadamard Transform Kernel (Dec 2024) — shows how to get high throughput by remapping FWHT to tensor cores. arXiv+1
A 2014–2015 line of work on optimized FWHT on GPUs — shared memory, bank conflict avoidance, tiling strategies (useful background/techniques). Estudo GeralResearchGate
Recent arXiv (Aug 2025) CountSketch paper noting FWHT isn’t in cuFFT and describing practical implementations for sketches — useful for applied ML contexts. arXiv
Older survey/papers on GPU spectral transforms that discuss memory-bound nature and reformulations. ResearchGate

Option - FWHT on multi-core CPU server board:

CPU Library

Search This Blog

Science Limelight

Is the Fast Walsh Hadamard transform (FWHT) a poor fit for current GPUs?

Why plain FWHT often performs poorly on GPUs

What researchers and implementers have done (short survey)

Practical takeaway — when the claim “FWHT is a poor fit for GPUs” is fair

Recommendations (if you need to implement FWHT on GPUs)

Concrete references (read next)

Comments

Post a Comment

Popular posts from this blog

Neon Bulb Oscillators

23 Circuits you can Build in an Hour - Free Book

Q Multiplier Circuits