Sparsity Is Not Speed

You can remove 50% of a model's weights and make it slower.

This is not a paradox. It is a hardware constraint that most practitioners encounter only after the damage is done.

The mistake

Pruning removes weights. The goal is a smaller, faster model. The assumption — rarely stated, almost always present — is that fewer weights means less computation.

On standard GPU hardware, that assumption is wrong for unstructured pruning.

GPUs are designed for dense matrix multiplication. Their Tensor Cores execute regular, rectangular operations in lockstep across thousands of CUDA cores. When you introduce irregular zeros — scattered randomly through a weight matrix — the hardware cannot skip them. It still loads the full tensor, executes the full matrix multiply, and discards results for the zero positions. The sparsity is real. The speedup is not.

Two kinds of pruning

Unstructured pruning removes individual weights based on magnitude or gradient criteria. It achieves the highest accuracy-sparsity tradeoff — SparseGPT demonstrated that LLMs with 175 billion parameters can reach 50–60% sparsity with minimal perplexity loss. But on a standard GPU, a 50% sparse model from unstructured pruning runs at roughly the same speed as the dense original. Sometimes slower, due to sparse format overhead.

Structured pruning removes entire components: attention heads, MLP neurons, layers, or channels. The result is a smaller dense model — fewer parameters, smaller weight matrices, standard matrix multiplication. Standard hardware accelerates it automatically. LLM-Pruner and similar frameworks report 2–5x latency improvements on commodity hardware by removing whole structural units.

The tradeoff is accuracy. Structured pruning applies coarser cuts. At high sparsity ratios, the accuracy cost climbs faster than with unstructured approaches.

The middle ground: 2:4 sparsity

NVIDIA's Ampere architecture (A100 and later) introduced hardware support for a specific semi-structured pattern: 2:4 sparsity, where exactly two of every four contiguous values are zero. Sparse Tensor Cores handle this pattern natively, delivering 2x theoretical compute throughput over dense equivalents. Memory bandwidth drops by roughly half. The sparsity is structured enough for hardware to exploit, fine-grained enough to preserve accuracy.

This is not a general solution. It requires Ampere or newer hardware, specific pruning tooling, and the 50% sparsity rate is fixed by the pattern. But it closes the gap: near-unstructured accuracy at near-structured speed.

The practical question

Before pruning, ask one question: does my inference hardware have a fast path for this sparsity pattern?

If you are serving on A100/H100 with 2:4 support: semi-structured pruning is a strong choice.

If you are serving on commodity or edge hardware: structured pruning is the only reliable path to real speedup.

If you need maximum compression and have sparse-aware inference software (SpInfer, llama.cpp sparse extensions): unstructured pruning at high ratios makes sense — but only if your runtime actually exploits the zeros.

Sparsity without a hardware fast path is dead weight removal. The model is lighter on disk. It is not faster at inference.

Know your hardware before you pick your pruning strategy. The math does not run the metal.

References

1. Frantar, E., Alistarh, D. (2023). SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot. ICML 2023. https://arxiv.org/abs/2301.00774

2. Ma, X., et al. (2023). LLM-Pruner: On the Structural Pruning of Large Language Models. NeurIPS 2023. https://proceedings.neurips.cc/paper_files/paper/2023/file/44956951349095f74492a5471128a7e0-Paper-Conference.pdf

3. Pool, J. (2021). Accelerating Inference with Sparsity Using the NVIDIA Ampere Architecture and NVIDIA TensorRT. NVIDIA Technical Blog. https://developer.nvidia.com/blog/accelerating-inference-with-sparsity-using-ampere-and-tensorrt/

4. Mishra, A., et al. (2021). Accelerating Sparse Deep Neural Networks. arXiv:2104.08378. https://arxiv.org/abs/2104.08378

5. Sun, M., et al. (2024). A Simple and Effective Pruning Approach for Large Language Models (Wanda). ICLR 2024. https://arxiv.org/abs/2306.11695

6. Zhong, H., et al. (2024). A Survey on Deep Neural Network Pruning: Taxonomy, Comparison, Analysis, and Recommendations. arXiv:2308.06767. https://arxiv.org/abs/2308.06767