Parallelism Is a Topology Decision

When a model does not fit on one GPU, you parallelize. Two strategies dominate: tensor parallelism (TP) and pipeline parallelism (PP). Most documentation treats them as equivalent options. They are not.

What Each Strategy Actually Does

Tensor parallelism shards individual weight matrices across GPUs. Each GPU holds a slice of every layer. For a linear projection with weight matrix W of shape [d_model, d_ff], column-parallel TP gives each of N GPUs a shard of shape [d_model, d_ff/N]. Every GPU computes its shard's output in parallel, then an AllReduce synchronizes results before the next layer begins.

Pipeline parallelism partitions the model's layers into sequential stages, each stage assigned to one or more GPUs. GPU 0 handles layers 0–7, GPU 1 handles layers 8–15, and so on. Data flows through the pipeline in microbatches. Each stage processes its assigned layers and passes activations forward.

Same number of GPUs. Completely different computation graphs.

The Communication Difference

This is where the two strategies diverge most sharply.

TP requires an AllReduce at every layer boundary — all N GPUs must synchronize before the next layer can start. For a 32-layer transformer with TP=4, that is 64 AllReduce operations per forward pass (two per attention+FFN block, as established in Megatron-LM's design). AllReduce over four GPUs on NVLink costs roughly 10–50 µs per call. Across 64 calls, the communication overhead is real but fast — as long as the interconnect is fast.

PP requires only point-to-point communication between adjacent stages: stage 0 sends activations to stage 1, stage 1 sends to stage 2. Far fewer synchronization events. But pipeline bubbles — the idle time when a stage waits for the previous one — become the cost you pay instead.

The rule that follows: TP wants fast intra-node interconnect (NVLink, NVSwitch). PP tolerates slower inter-node links (InfiniBand, Ethernet). Use TP within a node. Use PP across nodes.

The Latency-Throughput Tradeoff

Empirical results on Llama-3.1-70B and 405B confirm what the architecture implies: TP reduces latency, PP increases throughput.

TP reduces latency because all GPUs work on the same request simultaneously. The time-to-first-token drops because each layer completes faster — you have more memory bandwidth serving one request.

PP increases throughput because the pipeline can hold multiple microbatches in flight. While stage 1 processes batch N, stage 0 is already working on batch N+1. Throughput scales with pipeline depth. Latency does not improve — and can worsen because of bubble overhead at small batch sizes.

The NVIDIA TensorRT-LLM documentation states this plainly: "Pipeline parallelism is a low-overhead mechanism for efficiently increasing overall throughput, while tensor parallelism is a higher-overhead mechanism for reducing latency."

What This Means for Self-Hosted Inference

If you are running on a single multi-GPU node (two to eight GPUs, NVLink or PCIe):

Use TP. Set --tensor-parallel-size to your GPU count.
PP adds bubble overhead with no benefit on a single node.

If you are running across multiple nodes:

Use TP within each node (intra-node NVLink is fast).
Use PP across nodes (inter-node bandwidth is the bottleneck for AllReduce, not for point-to-point).
vLLM's recommendation: tensor_parallel_size = GPUs per node, pipeline_parallel_size = number of nodes.

If you are latency-sensitive (interactive chat, streaming):

Prioritize TP. Minimize pipeline depth.

If you are throughput-sensitive (batch processing, offline jobs):

PP is acceptable. Pipeline bubbles matter less when you have large batches.

The Mistake to Avoid

Setting --pipeline-parallel-size on a single node to "use more GPUs" does not help. It introduces bubble overhead without the inter-node bandwidth savings that motivate PP in the first place. vLLM will not warn you. The model will run — just slower.

The wrong parallelism strategy wastes 30–50% of GPU memory or tanks throughput. The decision belongs in your architecture documentation, made before deployment, not discovered after.

References

Narayanan, D., Shoeybi, M., Casper, J., et al. (2021). Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM. SC 2021. https://arxiv.org/abs/2104.04473
NVIDIA. (2024). Boosting Llama 3.1 405B Throughput by Another 1.5x on NVIDIA H200 Tensor Core GPUs and NVLink Switch. NVIDIA Technical Blog. https://developer.nvidia.com/blog/boosting-llama-3-1-405b-throughput-by-another-1-5x-on-nvidia-h200-tensor-core-gpus-and-nvlink-switch/
vLLM Documentation. Parallelism and Scaling. https://docs.vllm.ai/en/stable/serving/parallelism_scaling/
NVIDIA. Deciding Model Sharding Strategy. TensorRT-LLM Documentation. https://nvidia.github.io/TensorRT-LLM/performance/performance-tuning-guide/deciding-model-sharding-strategy.html
NVIDIA. Parallelism Strategies Guide. Megatron Core Developer Guide. https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/parallelism-guide.html