whitepaper

Choosing Your Inference Engine: llama.cpp, Ollama, and vLLM

devinfo.dev — May 25, 2026

devinfo.dev:2026.0006

#inference #llm #vllm #llama.cpp #ollama #self-hosted

Save as PDF

The Wrong Question

Most comparisons of inference engines ask: which is fastest?

That question has no useful answer, because performance depends entirely on the workload. At one concurrent user, llama.cpp, Ollama, and vLLM produce nearly identical throughput — within 3% of each other on the same hardware. At 128 concurrent users, vLLM is 8–9x faster than Ollama.

The right question is not which is fastest but fastest at what.

This paper answers that question.

---

The Three Engines, Accurately Described

llama.cpp

llama.cpp is a C/C++ inference library. It is not a server. It is the computational core that all three engines are built around — or in the case of Ollama, literally use as their backend.

Its defining characteristics:

GGUF quantization: the dominant format for quantized local models. Q4_K_M, Q5_K_M, Q6_K — all run here.
Hardware breadth: CPU, NVIDIA GPU (CUDA), AMD GPU (ROCm), Apple Silicon (Metal), hybrid CPU+GPU offload. No other engine matches this.
Long-context efficiency: fits 2x larger context windows at the same VRAM budget compared to vLLM, because it does not reserve memory for KV cache slots in advance.
No continuous batching: throughput plateaus around batch size 4 (~198 tok/s). Not designed for concurrent API serving.

llama.cpp is the right choice when you are embedding inference in a tool, running on CPU, or need maximum portability.

Ollama

Ollama is llama.cpp with a REST API wrapper, a model registry, and a developer-friendly CLI. Inference performance is identical to llama.cpp within measurement noise (0–3% variance). The difference is operational, not computational.

What Ollama adds:

ollama run, ollama pull — model management without manual GGUF downloads.
OpenAI-compatible /api/chat endpoint — drop-in for tooling that expects an OpenAI API shape.
Cross-platform packaging: macOS, Linux, Windows.

What it does not add:

Concurrency. Ollama serializes requests. Under any meaningful load (5+ concurrent users), it queues requests and latency climbs steeply.
Production observability. Metrics, health endpoints, and load balancing require external tooling.

Ollama is the right choice for developer workstations, personal inference servers, and embedded single-user tools.

vLLM

vLLM is a Python inference engine built for multi-user throughput. Its core innovation is PagedAttention — a KV cache management system borrowed from virtual memory systems in operating systems.

Traditional inference reserves a contiguous block of memory for each request's KV cache upfront, sized for the maximum sequence length. Most of that memory is unused for most of the request's lifetime. PagedAttention allocates KV cache in small, non-contiguous pages, filled on demand. This dramatically reduces memory waste and enables true continuous batching: new requests are interleaved with in-flight requests at the token level, not the request level.

The result: at 128 concurrent users, vLLM achieves 920 tok/s vs Ollama's ~160 tok/s. Time-to-first-token: 145ms vs 3,200ms. These are not marginal differences.

What vLLM costs:

GPU dependency: production-ready on NVIDIA (CUDA) and AMD (ROCm). Not yet reliable on Apple Silicon.
Memory reservation: it pre-allocates KV cache pages on startup. A 70B model may silently OOM on hardware with <48GB VRAM, where llama.cpp can fit the same model quantized to 24GB.
Complexity: Python stack, startup time, configuration surface. Not appropriate for embedding in a CLI tool.

vLLM is the right choice for shared inference APIs, multi-user self-hosted endpoints, and anything serving more than a few concurrent requests.

---

Performance: By Workload

The numbers below represent representative benchmarks on single-GPU hardware (RTX 4090 class). Your hardware will produce different absolute numbers; the ratios hold.

Single user, batch size 1

| Engine | Throughput | Notes |

|--------|-----------|-------|

| llama.cpp | ~42 tok/s | Baseline |

| Ollama | ~42 tok/s | Identical (same backend) |

| vLLM | ~45 tok/s | Within noise |

At one user, engine choice is irrelevant to performance. Pick on operational criteria.

Ten concurrent users

| Engine | Throughput | Success rate |

|--------|-----------|-------------|

| llama.cpp | ~180 tok/s | 100% (queued) |

| Ollama | ~160 tok/s | 100% (queued) |

| vLLM | ~520 tok/s | 100% |

vLLM begins to pull ahead. Ollama and llama.cpp serialize requests — higher throughput comes from shorter queue times, not parallel processing.

50–128 concurrent users

| Engine | Aggregate throughput | TTFR (P50) |

|--------|---------------------|------------|

| Ollama | ~160 tok/s | 3,200ms |

| vLLM | ~920 tok/s | 145ms |

Ollama effectively breaks under this load — requests succeed but latency becomes unusable. vLLM scales linearly with VRAM until memory-bound.

CPU-only inference

| Engine | Notes |

|--------|-------|

| llama.cpp | Best-in-class. Hand-tuned AVX-512 kernels. |

| Ollama | Identical to llama.cpp (same backend). |

| vLLM | Not viable. GPU-only architecture. |

Context window at fixed VRAM

| Engine | 70B model, 24GB VRAM |

|--------|---------------------|

| llama.cpp (Q4_K_M) | Fits. ~8K–16K context. |

| vLLM | OOMs. Requires 48GB+ for 70B. |

---

The Architectural Reason for the Performance Gap

The gap between Ollama and vLLM at scale is not a tuning difference. It is architectural.

Ollama (llama.cpp) processes requests serially or in small static batches. Each request occupies the GPU from first token to last. A 500-token output blocks 499 other requests for the duration of that generation.

vLLM's continuous batching means a request's decode step is interleaved with other requests' prefill and decode steps at the token level. The GPU is never waiting for one slow request to finish before starting another. Every forward pass is dense.

This is the same design principle as preemptive multitasking in operating systems. Static batching is cooperative — requests yield only when done. Continuous batching is preemptive — the scheduler allocates GPU time across all in-flight requests on every token step.

You cannot replicate this behavior with configuration. It requires a different architecture.

---

Memory: The Hidden Constraint

Memory behavior is where the wrong choice becomes expensive.

vLLM pre-allocates KV cache on startup. It reserves VRAM for the maximum number of concurrent KV cache pages it may need. This means:

VRAM usage is high even at idle.
Large models on constrained hardware will OOM silently — the engine starts, but fails when context grows.
On a 24GB card, a 70B model in any format will not run.

llama.cpp and Ollama allocate KV cache dynamically. They use only what the current request requires. A 70B Q4_K_M model (requiring ~40GB of weights) can be partially offloaded to system RAM via GPU layers — slower, but functional. A 13B Q5_K_M model runs fully in VRAM on a 16GB card with context left over.

The practical rule: if your hardware is memory-constrained, llama.cpp wins. If you have abundant GPU memory and serve concurrent users, vLLM wins.

---

Deployment Scenarios

Personal inference server (1–3 users)

Use Ollama. It gives you llama.cpp performance with model management and an OpenAI-compatible API. Run it as a systemd service. Done.


sudo systemctl enable ollama
sudo systemctl start ollama
ollama pull llama3.2


No configuration file. No Python environment. No GPU driver compatibility matrix to navigate.
Shared team endpoint (5–50 users)
Use vLLM. Install into a Python venv, configure via environment variables, run behind a reverse proxy.

bash
python -m venv .venv
source .venv/bin/activate
pip install vllm
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --max-model-len 8192

Expose through nginx with basic auth. Monitor /metrics with Prometheus. Add a rate limiter.


Embedded tool (CLI, script, library)

Use llama.cpp directly (via bindings or subprocess). You want inference without a running daemon. The llama-cpp-python package exposes the C library with a Python interface.

python
from llama_cpp import Llama
llm = Llama(model_path="./model.gguf", n_gpu_layers=-1)
output = llm("The capital of France is", max_tokens=32)

No server, no network overhead, no port to manage.

Edge / CPU-only / ARM

llama.cpp. No alternatives at this level. It runs on Raspberry Pi, AWS Graviton, OCI A1 ARM instances, and Apple Silicon with native Metal acceleration.

---

Quantization Alignment

Each engine has preferred quantization formats.

llama.cpp / Ollama: GGUF exclusively. The full K-quant ladder (Q2_K through Q8_0) plus F16 are all supported. This is the format for local and CPU inference.

vLLM: AWQ, GPTQ, and FP8 for GPU inference. GGUF support exists via experimental paths but is not the primary format. If you are running vLLM, you are likely downloading from Hugging Face in the native format.

Do not mix engines and formats carelessly. A Q5_K_M GGUF will not load into vLLM without conversion. An AWQ model will not load into Ollama.

---

The Decision Framework

Answer these questions in order:

1. How many concurrent users?

1–4: any engine. Use Ollama for ease.
5–20: vLLM starts to matter. Test both.
20+: vLLM. Non-negotiable.

2. What hardware?

CPU only: llama.cpp or Ollama.
NVIDIA / AMD GPU with 24GB+: vLLM viable.
Apple Silicon: llama.cpp or Ollama. vLLM is not production-ready here.
ARM (OCI A1, Graviton, Raspberry Pi): llama.cpp only.

3. How memory-constrained are you?

Tight VRAM: llama.cpp. Use CPU offload if needed.
Generous VRAM, GPU-only: vLLM. Pre-allocation is an acceptable tradeoff.

4. What is your operational tolerance?

Minimal: Ollama. It is a single binary with sensible defaults.
Willing to manage a Python service: vLLM gives you the throughput.
Embedding in code: llama.cpp directly.

---

What Ollama Actually Is

One clarification worth making explicit: Ollama is not a separate inference engine. It is llama.cpp. Every benchmark that shows Ollama and llama.cpp within 0–3% of each other is confirming this. They are the same compute path.

The choice between Ollama and llama.cpp is a choice between a managed service and a library. If you want a running HTTP server with model management, use Ollama. If you want to call inference from code without a daemon, use llama.cpp.

Do not run both simultaneously and expect twice the throughput. You are running the same engine twice.

---

Summary

|-----------|-----------|--------|------|

| Multi-user throughput | Low | Low | High (8-9x) |

| CPU/ARM support | Yes | Yes | No |

The engines are not competitors. They occupy different positions in the deployment landscape. llama.cpp is a library for maximum portability. Ollama is a developer-friendly wrapper around it. vLLM is a production serving engine for multi-user GPU deployments.

Know your workload. Pick accordingly. Then stop second-guessing.

References

Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C.H., Gonzalez, J.E., Zhang, H., & Stoica, I. (2023). "Efficient Memory Management for Large Language Model Serving with PagedAttention." ACM SOSP 2023. https://arxiv.org/abs/2309.06180
Gerganov, G. et al. (2023-present). llama.cpp: LLM inference in C/C++. https://github.com/ggerganov/llama.cpp
Ollama. (2023-present). Get up and running with large language models. https://ollama.ai
vLLM Project. (2023-present). Easy, fast, and cheap LLM serving. https://vllm.ai
Yu, G.I., Jeong, J.S., Kim, G., Kim, S., & Chun, B. (2022). "Orca: A Distributed Serving System for Transformer-Based Generative Models." OSDI 2022. (Foundational work on continuous batching.)
Leviathan, Y., Kalman, M., & Matias, Y. (2023). "Fast Inference from Transformers via Speculative Decoding." ICML 2023.

Cite as

devinfo.dev. (2026). "Choosing Your Inference Engine: llama.cpp, Ollama, and vLLM." devinfo.dev:2026.0006. https://devinfo.dev/d/2026.0006

devinfo.dev | https://devinfo.dev/d/2026.0006
Content licensed under CC BY-NC 4.0. Free to share with attribution for non-commercial use.
https://devinfo.dev