Use Ctrl+P (or Cmd+P) to save as PDF. Back to paper
Most comparisons of inference engines ask: which is fastest?
That question has no useful answer, because performance depends entirely on the workload. At one concurrent user, llama.cpp, Ollama, and vLLM produce nearly identical throughput — within 3% of each other on the same hardware. At 128 concurrent users, vLLM is 8–9x faster than Ollama.
The right question is not which is fastest but fastest at what.
This paper answers that question.
---
llama.cpp is a C/C++ inference library. It is not a server. It is the computational core that all three engines are built around — or in the case of Ollama, literally use as their backend.
Its defining characteristics:
llama.cpp is the right choice when you are embedding inference in a tool, running on CPU, or need maximum portability.
Ollama is llama.cpp with a REST API wrapper, a model registry, and a developer-friendly CLI. Inference performance is identical to llama.cpp within measurement noise (0–3% variance). The difference is operational, not computational.
What Ollama adds:
ollama run, ollama pull — model management without manual GGUF downloads./api/chat endpoint — drop-in for tooling that expects an OpenAI API shape.What it does not add:
Ollama is the right choice for developer workstations, personal inference servers, and embedded single-user tools.
vLLM is a Python inference engine built for multi-user throughput. Its core innovation is PagedAttention — a KV cache management system borrowed from virtual memory systems in operating systems.
Traditional inference reserves a contiguous block of memory for each request's KV cache upfront, sized for the maximum sequence length. Most of that memory is unused for most of the request's lifetime. PagedAttention allocates KV cache in small, non-contiguous pages, filled on demand. This dramatically reduces memory waste and enables true continuous batching: new requests are interleaved with in-flight requests at the token level, not the request level.
The result: at 128 concurrent users, vLLM achieves 920 tok/s vs Ollama's ~160 tok/s. Time-to-first-token: 145ms vs 3,200ms. These are not marginal differences.
What vLLM costs:
vLLM is the right choice for shared inference APIs, multi-user self-hosted endpoints, and anything serving more than a few concurrent requests.
---
The numbers below represent representative benchmarks on single-GPU hardware (RTX 4090 class). Your hardware will produce different absolute numbers; the ratios hold.
| Engine | Throughput | Notes |
|--------|-----------|-------|
| llama.cpp | ~42 tok/s | Baseline |
| Ollama | ~42 tok/s | Identical (same backend) |
| vLLM | ~45 tok/s | Within noise |
At one user, engine choice is irrelevant to performance. Pick on operational criteria.
| Engine | Throughput | Success rate |
|--------|-----------|-------------|
| llama.cpp | ~180 tok/s | 100% (queued) |
| Ollama | ~160 tok/s | 100% (queued) |
| vLLM | ~520 tok/s | 100% |
vLLM begins to pull ahead. Ollama and llama.cpp serialize requests — higher throughput comes from shorter queue times, not parallel processing.
| Engine | Aggregate throughput | TTFR (P50) |
|--------|---------------------|------------|
| Ollama | ~160 tok/s | 3,200ms |
| vLLM | ~920 tok/s | 145ms |
Ollama effectively breaks under this load — requests succeed but latency becomes unusable. vLLM scales linearly with VRAM until memory-bound.
| Engine | Notes |
|--------|-------|
| llama.cpp | Best-in-class. Hand-tuned AVX-512 kernels. |
| Ollama | Identical to llama.cpp (same backend). |
| vLLM | Not viable. GPU-only architecture. |
| Engine | 70B model, 24GB VRAM |
|--------|---------------------|
| llama.cpp (Q4_K_M) | Fits. ~8K–16K context. |
| vLLM | OOMs. Requires 48GB+ for 70B. |
---
The gap between Ollama and vLLM at scale is not a tuning difference. It is architectural.
Ollama (llama.cpp) processes requests serially or in small static batches. Each request occupies the GPU from first token to last. A 500-token output blocks 499 other requests for the duration of that generation.
vLLM's continuous batching means a request's decode step is interleaved with other requests' prefill and decode steps at the token level. The GPU is never waiting for one slow request to finish before starting another. Every forward pass is dense.
This is the same design principle as preemptive multitasking in operating systems. Static batching is cooperative — requests yield only when done. Continuous batching is preemptive — the scheduler allocates GPU time across all in-flight requests on every token step.
You cannot replicate this behavior with configuration. It requires a different architecture.
---
Memory behavior is where the wrong choice becomes expensive.
vLLM pre-allocates KV cache on startup. It reserves VRAM for the maximum number of concurrent KV cache pages it may need. This means:
llama.cpp and Ollama allocate KV cache dynamically. They use only what the current request requires. A 70B Q4_K_M model (requiring ~40GB of weights) can be partially offloaded to system RAM via GPU layers — slower, but functional. A 13B Q5_K_M model runs fully in VRAM on a 16GB card with context left over.
The practical rule: if your hardware is memory-constrained, llama.cpp wins. If you have abundant GPU memory and serve concurrent users, vLLM wins.
---
Use Ollama. It gives you llama.cpp performance with model management and an OpenAI-compatible API. Run it as a systemd service. Done.
``
sudo systemctl enable ollama
sudo systemctl start ollama
ollama pull llama3.2
`
No configuration file. No Python environment. No GPU driver compatibility matrix to navigate.
Shared team endpoint (5–50 users)
Use vLLM. Install into a Python venv, configure via environment variables, run behind a reverse proxy.
`bash
python -m venv .venv
source .venv/bin/activate
pip install vllm
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--max-model-len 8192
`
Expose through nginx with basic auth. Monitor
/metrics with Prometheus. Add a rate limiter.
Embedded tool (CLI, script, library)
Use llama.cpp directly (via bindings or subprocess). You want inference without a running daemon. The
llama-cpp-python package exposes the C library with a Python interface.
`python
from llama_cpp import Llama
llm = Llama(model_path="./model.gguf", n_gpu_layers=-1)
output = llm("The capital of France is", max_tokens=32)
``
No server, no network overhead, no port to manage.
llama.cpp. No alternatives at this level. It runs on Raspberry Pi, AWS Graviton, OCI A1 ARM instances, and Apple Silicon with native Metal acceleration.
---
Each engine has preferred quantization formats.
llama.cpp / Ollama: GGUF exclusively. The full K-quant ladder (Q2_K through Q8_0) plus F16 are all supported. This is the format for local and CPU inference.
vLLM: AWQ, GPTQ, and FP8 for GPU inference. GGUF support exists via experimental paths but is not the primary format. If you are running vLLM, you are likely downloading from Hugging Face in the native format.
Do not mix engines and formats carelessly. A Q5_K_M GGUF will not load into vLLM without conversion. An AWQ model will not load into Ollama.
---
Answer these questions in order:
1. How many concurrent users?
2. What hardware?
3. How memory-constrained are you?
4. What is your operational tolerance?
---
One clarification worth making explicit: Ollama is not a separate inference engine. It is llama.cpp. Every benchmark that shows Ollama and llama.cpp within 0–3% of each other is confirming this. They are the same compute path.
The choice between Ollama and llama.cpp is a choice between a managed service and a library. If you want a running HTTP server with model management, use Ollama. If you want to call inference from code without a daemon, use llama.cpp.
Do not run both simultaneously and expect twice the throughput. You are running the same engine twice.
---
| Criterion | llama.cpp | Ollama | vLLM |
|-----------|-----------|--------|------|
| Single-user performance | Identical | Identical | Identical |
| Multi-user throughput | Low | Low | High (8-9x) |
| Hardware breadth | Widest | Wide | GPU-only |
| Memory efficiency | Best | Best | Reservation-based |
| Operational simplicity | Low | High | Medium |
| CPU/ARM support | Yes | Yes | No |
| Format | GGUF | GGUF | AWQ/GPTQ/FP8 |
The engines are not competitors. They occupy different positions in the deployment landscape. llama.cpp is a library for maximum portability. Ollama is a developer-friendly wrapper around it. vLLM is a production serving engine for multi-user GPU deployments.
Know your workload. Pick accordingly. Then stop second-guessing.