The Activation Is a Gate
The Activation Is a Gate
Every modern open-weight LLM — LLaMA, Mistral, PaLM, Gemma, Qwen — uses SwiGLU in its feed-forward layers. Most practitioners know this as a footnote in the architecture description. It is more than that.
What Changed
A standard Transformer FFN applies one linear projection, one activation, and a second linear projection:
``
FFN(x) = activation(xW₁) W₂
`
Two weight matrices. One nonlinearity.
SwiGLU restructures this into three weight matrices and a gating mechanism:
`
FFN_SwiGLU(x) = (Swish(xW) ⊙ xV) W₂
`
The gate path (
xW) passes through Swish (also called SiLU). The value path (xV) passes through unchanged. Their element-wise product is the output — then projected down by W₂.
This is not a drop-in activation swap. It is a different layer shape.
The Parameter Arithmetic
Three weight matrices instead of two means roughly 50% more FFN parameters at the same expansion factor. Practitioners compensate by reducing the hidden dimension: if the standard FFN uses hidden size
4d, the SwiGLU FFN uses 8d/3 — keeping total parameter count equivalent.
The equation is exact:
`
3 × d × d_ff_gated = 2 × d × d_ff_std
→ d_ff_gated = (2/3) × d_ff_std
`
Llama 3 8B uses an FFN hidden size of 14,336 against a model dimension of 4,096 — that ratio (≈3.5×) is the 8/3 adjustment in action.
Why It Works
Gating lets the network learn not just what to activate, but which activations to pass through. The gate acts as a learned filter on the value stream. Standard ReLU or GELU activations apply the same nonlinearity to all dimensions; SwiGLU applies a learned, input-dependent mask.
Shazeer's 2020 experiments found that SwiGLU (and its cousin GEGLU) consistently produced lower perplexity than ReLU or GELU variants across model sizes. PaLM adopted it in 2022. LLaMA adopted it in 2023. It is now the default.
The Inference Consequence
Every quantization tool, every serving backend, every memory budget calculation that touches the FFN layers must account for three projection matrices, not two. When you calculate the parameter count of a LLaMA 3 70B model, roughly two-thirds of non-attention parameters live in FFN layers — and those layers have a different shape than what earlier tooling assumed.
This matters for:
- Quantization schemes: calibration and rounding errors compound across three projections, not two
- Memory layout: the gate and value matrices are typically fused into a single tensor (
W_gate_up`) in optimized runtimes for better memory bandwidth
The activation is not a detail. It is a load-bearing structural choice that was made at training time — and every system that runs the model lives with the consequences.
References
1. Shazeer, N. (2020). GLU Variants Improve Transformer. arXiv:2002.05202. https://arxiv.org/abs/2002.05202
2. Touvron, H. et al. (2023). Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288. https://arxiv.org/abs/2307.09288
3. Chowdhery, A. et al. (2022). PaLM: Scaling Language Modeling with Pathways. arXiv:2204.02311. https://arxiv.org/abs/2204.02311
4. CogniSoc ZigLlama Documentation. Feed-Forward Networks. https://docs.cognisoc.com/zigllm/transformers/feed-forward-networks/
Cite as
devinfo.dev. (2026). "The Activation Is a Gate." devinfo.dev:2026.0056. https://devinfo.dev/d/2026.0056
devinfo.dev | https://devinfo.dev/d/2026.0056
Content licensed under CC BY-NC 4.0. Free to share with attribution for non-commercial use.
https://devinfo.dev