Use Ctrl+P (or Cmd+P) to save as PDF. Back to paper

Parallelism Is a Topology Decision

inspiration | devinfo.dev | June 13, 2026 | devinfo.dev:2026.0031

Tensor parallelism and pipeline parallelism are not interchangeable scaling knobs. They encode different assumptions about your hardware, your model shape, and what you are optimizing for. Choosing wrong does not just waste GPUs — it locks in a latency-throughput tradeoff you did not knowingly make.

Parallelism Is a Topology Decision

When a model does not fit on one GPU, you parallelize. Two strategies dominate: tensor parallelism (TP) and pipeline parallelism (PP). Most documentation treats them as equivalent options. They are not.

What Each Strategy Actually Does

Tensor parallelism shards individual weight matrices across GPUs. Each GPU holds a slice of every layer. For a linear projection with weight matrix W of shape [d_model, d_ff], column-parallel TP gives each of N GPUs a shard of shape [d_model, d_ff/N]. Every GPU computes its shard's output in parallel, then an AllReduce synchronizes results before the next layer begins.

Pipeline parallelism partitions the model's layers into sequential stages, each stage assigned to one or more GPUs. GPU 0 handles layers 0–7, GPU 1 handles layers 8–15, and so on. Data flows through the pipeline in microbatches. Each stage processes its assigned layers and passes activations forward.

Same number of GPUs. Completely different computation graphs.

The Communication Difference

This is where the two strategies diverge most sharply.

TP requires an AllReduce at every layer boundary — all N GPUs must synchronize before the next layer can start. For a 32-layer transformer with TP=4, that is 64 AllReduce operations per forward pass (two per attention+FFN block, as established in Megatron-LM's design). AllReduce over four GPUs on NVLink costs roughly 10–50 µs per call. Across 64 calls, the communication overhead is real but fast — as long as the interconnect is fast.

PP requires only point-to-point communication between adjacent stages: stage 0 sends activations to stage 1, stage 1 sends to stage 2. Far fewer synchronization events. But pipeline bubbles — the idle time when a stage waits for the previous one — become the cost you pay instead.

The rule that follows: TP wants fast intra-node interconnect (NVLink, NVSwitch). PP tolerates slower inter-node links (InfiniBand, Ethernet). Use TP within a node. Use PP across nodes.

The Latency-Throughput Tradeoff

Empirical results on Llama-3.1-70B and 405B confirm what the architecture implies: TP reduces latency, PP increases throughput.

TP reduces latency because all GPUs work on the same request simultaneously. The time-to-first-token drops because each layer completes faster — you have more memory bandwidth serving one request.

PP increases throughput because the pipeline can hold multiple microbatches in flight. While stage 1 processes batch N, stage 0 is already working on batch N+1. Throughput scales with pipeline depth. Latency does not improve — and can worsen because of bubble overhead at small batch sizes.

The NVIDIA TensorRT-LLM documentation states this plainly: "Pipeline parallelism is a low-overhead mechanism for efficiently increasing overall throughput, while tensor parallelism is a higher-overhead mechanism for reducing latency."

What This Means for Self-Hosted Inference

If you are running on a single multi-GPU node (two to eight GPUs, NVLink or PCIe):

If you are running across multiple nodes:

If you are latency-sensitive (interactive chat, streaming):

If you are throughput-sensitive (batch processing, offline jobs):

The Mistake to Avoid

Setting --pipeline-parallel-size on a single node to "use more GPUs" does not help. It introduces bubble overhead without the inter-node bandwidth savings that motivate PP in the first place. vLLM will not warn you. The model will run — just slower.

The wrong parallelism strategy wastes 30–50% of GPU memory or tanks throughput. The decision belongs in your architecture documentation, made before deployment, not discovered after.

References