#inference — devinfo.dev

inspiration

Continuous Batching Is Why Your Server Is Fast

Static batching wastes the GPU: every request waits for the slowest one in its batch to finish. Continuous batching — the idea behind Orca and vLLM — schedules at the token level instead of the request level, and it is the single biggest reason modern serving throughput is what it is.

July 16, 2026

inspiration

Speculative Decoding Is a Bet on the Draft

Speculative decoding makes a large model generate faster by letting a small model guess ahead. It is lossless — the output is identical to decoding from the large model alone. But the entire speedup is a function of how often the draft is right, which makes the technique only as good as the match between your draft and target models.

July 15, 2026

inspiration

Quantization Is a Memory-Bandwidth Decision

Dropping a model from FP16 to INT4 is usually framed as a way to fit it in less VRAM. That is the smaller half of the story. When you serve a single stream, token generation is bound by memory bandwidth, not arithmetic — every token reads the entire model from memory once. Quantization shrinks that read, so it buys throughput, not just capacity.

July 13, 2026

inspiration

The Adapter Is Not the Model

The obvious way to serve ten fine-tuned variants is to run ten models. That is wrong. A LoRA adapter is a thin correction on top of a base model — and the base model is the same for all of them. Merging the adapter back into the weights before serving discards the one fact that makes multi-tenant fine-tuning cheap.

July 12, 2026

inspiration

The Image Is a Token Budget

A multimodal model does not see an image. It reads a sequence of tokens — and the number of tokens an image produces is an engineering decision, not a fixed property of the image. Most engineers discover this only when their context window fills up.

July 11, 2026

inspiration

The Grammar Is the Guarantee

Asking a model to output valid JSON is a prompt. Enforcing valid JSON at every decode step is an architecture decision. Constrained decoding masks invalid tokens before sampling — making malformed output structurally impossible, not just unlikely.

July 8, 2026

inspiration

Calibration Is the Work

Round-to-nearest quantization distributes precision evenly across all weights. That is the wrong allocation. A tiny fraction of weight channels — the ones multiplied by large activations — dominate model output. AWQ finds those channels first. RTN never looks.

July 7, 2026

inspiration

The Activation Is a Gate

Every modern open-weight LLM replaced its FFN activation function with SwiGLU. That swap added a third weight matrix, changed the hidden dimension arithmetic, and altered how information flows through every feed-forward layer. It is not a cosmetic improvement. It is a structural decision made once at training time that every inference system running the model must pay for.

July 5, 2026

inspiration

The Draft Model Does the Work

Speculative decoding uses a small draft model to propose tokens and a large model to verify them in parallel. The large model runs once per batch, not once per token. That single change converts a sequential bottleneck into a parallel verification step — and delivers 2–3x latency reduction at zero quality cost.

July 4, 2026

inspiration

Position Is a Rotation

Every modern open-weight LLM encodes token position using rotation. Not addition. The shift from additive to multiplicative position encoding is not cosmetic — it changes what the model can generalize to, how far its context can be extended, and what hardware-efficient tricks remain available.

July 3, 2026

inspiration

Thinking Tokens Are Compute

Test-time compute is a second axis of scaling, independent of model size. When a model generates reasoning tokens before its answer, it is not producing output — it is running computation. The cost of those tokens is the cost of a thinking process, and how you budget them determines what the model can solve.

June 30, 2026

whitepaper

The Memory Wall: A Field Guide to LLM Inference on Consumer Hardware

LLM inference is not compute-bound. It is memory-bandwidth-bound. Understanding that single fact — and the arithmetic that follows from it — determines every sensible hardware and quantization decision you will make when running models on consumer devices.

June 29, 2026

inspiration

The System Prompt Is Load-Bearing

The system prompt is not a style guide. It occupies a privileged position in the model's instruction hierarchy — one that was trained in, not just interpreted at runtime. Moving an instruction from system to user does not just change where it appears. It changes how much the model trusts it.

June 29, 2026

inspiration

The Chat Template Is the Interface

Every model family uses a different format to structure conversations into tokens. The chat template — a Jinja2 program stored inside the model — encodes that format. Apply the wrong one and the model never sees a conversation. It sees a text blob. The degradation is silent, and the model gets the blame.

June 28, 2026

inspiration

The Scheduler Is Not the Model

You can swap in a faster model and get slower responses. The model is not the bottleneck — the scheduler is. How the serving system decides what to run, in what order, and in what chunks determines your latency and throughput.

June 27, 2026

inspiration

Attention Heads Are Not Equal

Multi-head attention gives every query its own key and value heads. That is thorough — and expensive. Grouped-Query Attention proves the redundancy: Llama 3 70B serves 64 query heads from 8 KV heads, cuts its KV cache by 8x, and loses almost nothing in quality.

June 26, 2026

inspiration

Sparsity Is Not Speed

You can remove 50% of a model's weights and make it slower. Sparsity is a mathematical property. Speed is a hardware property. Confusing them is one of the most expensive mistakes in applied model compression.

June 25, 2026

inspiration

Distillation Is Not Compression

Quantization shrinks a model. Distillation trains a new one. The distinction is not semantic — it changes your compute budget, your deployment story, and what you can actually achieve at a given size.

June 24, 2026

inspiration

PagedAttention Is an OS Idea

Before PagedAttention, LLM serving systems wasted 60–80% of GPU memory on KV cache fragmentation. The fix was not a new neural architecture — it was a 1960s operating systems concept applied to the wrong layer.

June 23, 2026

inspiration

Prefill Is the Stall

The gap between submitting a prompt and receiving the first token is not network lag. It is compute. Prefill is a matrix multiplication over every token in your input — and it blocks decode entirely until it finishes.

June 21, 2026

inspiration

Sparse Is Not Small

A model with 671 billion parameters can cost less to run than a 70 billion dense model. That is not a marketing claim — it is arithmetic. Mixture of Experts replaces a full forward pass with a routing decision, and the routing decision is the cost model.

June 19, 2026

inspiration

Sampling Is a Filter

Top-k, top-p, and min-p are not interchangeable dials. Each one cuts the probability distribution at a different seam — and each has a failure mode that the others do not. Knowing which filter you are applying is a prerequisite to reasoning about your outputs.

June 18, 2026

inspiration

The Router Is the System

Routing between models is not a configuration detail. It is a measurable, trainable system boundary — and treating it as one cuts inference costs by 40–85% without sacrificing quality.

June 17, 2026

inspiration

Compression Is Not Cheating

Sending fewer tokens to the model is not a workaround. It is engineering. The context window tells you the maximum; it does not tell you the optimum.

June 16, 2026

inspiration

Long Context Is Not Long Attention

Expanding a model's context window does not guarantee it attends to all of that context. The window is a capacity claim. Attention quality across that capacity is a separate, structural problem — and it degrades in ways that are not visible in perplexity scores.

June 14, 2026

inspiration

Parallelism Is a Topology Decision

Tensor parallelism and pipeline parallelism are not interchangeable scaling knobs. They encode different assumptions about your hardware, your model shape, and what you are optimizing for. Choosing wrong does not just waste GPUs — it locks in a latency-throughput tradeoff you did not knowingly make.

June 13, 2026

inspiration

Merging Is Not Training

Model merging combines two or more fine-tuned LLMs into a single model without any gradient updates. No data. No compute budget. No training run. The result inherits capabilities from every source model — if you pick the right algorithm.

June 12, 2026

inspiration

The Tokenizer Is the Bug

Every LLM failure starts with the same invisible step: tokenization. It runs before inference, produces no logs, and degrades outputs silently. Most debugging sessions end at the model. They should start at the tokenizer.

June 11, 2026

inspiration

GGUF Is a Container, Not Just Weights

Every self-hosted AI practitioner downloads .gguf files. Few understand what they are. GGUF is not a weight dump — it is a self-contained container that carries the model, the tokenizer, the quantization scheme, and the chat template in a single file. That design decision changed how open-source models are distributed.

June 10, 2026

inspiration

Flash Attention Is an IO Problem

Standard attention is slow not because of arithmetic — it is slow because of memory traffic. Flash Attention solves the IO problem, not the compute problem. That distinction matters for how you think about every inference optimization that follows.

June 9, 2026

inspiration

Prefix Caching Is Free Throughput

Automatic Prefix Caching in vLLM reuses already-computed KV cache blocks across requests that share identical prefixes — delivering 30–50% throughput gains and up to 10x latency reduction at zero engineering cost beyond a single configuration flag.

June 8, 2026

inspiration

Continuous Batching: The Throughput Multiplier

Static batching wastes GPU cycles waiting for the slowest request. Continuous batching fixes this at the scheduler level — and the gains are not marginal.

June 7, 2026

inspiration

LoRA Is Not Fine-Tuning

LoRA does not update your model. It adds a thin, low-rank correction on top — and that distinction changes how you think about deployment, switching, and scale.

June 5, 2026

inspiration

Steering Is Not Prompting

Prompts influence what a model says. Activation steering changes what the model is, mid-inference. They are not the same tool.

June 4, 2026

inspiration

The KV Cache Is Your Real Memory Budget

The KV cache — not the model weights — is what limits how many tokens you can generate and how many requests you can serve. Understanding it changes how you provision hardware and tune inference.

June 3, 2026

inspiration

Attention Sinks: The Tokens That Hold Everything Together

Transformers quietly route a disproportionate share of attention to their first tokens — not because those tokens are important, but because softmax needs somewhere to put mass. Understanding this changes how you think about KV cache design.

June 2, 2026

inspiration

Temperature Is Not Creativity

Temperature is a probability reshaper, not a creativity dial. Calling it a creativity parameter is a category error — one that leads to misconfigured systems and wasted inference budget.

May 30, 2026

inspiration

Prompt Caching Is Free Money

Every time your app resends the same system prompt, you pay to compute it again. Prompt caching eliminates that cost by reusing precomputed KV tensors across requests. It requires no code changes and delivers up to 90% input token savings.

May 28, 2026

booklet

The LocalLLM Engine Stack: One API, Multiple Backends, Zero Lock-in

A single OpenAI-compatible endpoint that routes across Ollama, llama.cpp, and FreeLLMAPI with automatic failover. This booklet documents the architecture, routing logic, and deployment of the localllm-engine.

May 27, 2026

inspiration

Structured Outputs Are a Contract

Constrained generation is not a convenience feature. It is a systems boundary — a contract between your model and every downstream component that consumes its output.

May 27, 2026

booklet

Ollama Beyond Defaults: Custom Model Paths on Windows and WSL

Ollama assumes default paths. When your models live elsewhere, the documentation stops helping. This booklet covers every configuration path for Windows native, WSL2, and cross-boundary access.

May 27, 2026

booklet

From Free Tier to Sovereignty: Running Inference on Cloud ARM Instances

Free tier cloud compute promises self-hosted AI. The reality is capacity lotteries, region lock-in, and silent deprecation. This booklet documents what actually works, what does not, and how to build an inference setup that survives policy changes.

May 27, 2026

inspiration

Speculative Decoding: The Free Tokens

Speculative decoding cuts inference latency 2–3x without changing a single output token. The gain is real. So is the catch.

May 25, 2026

whitepaper

Choosing Your Inference Engine: llama.cpp, Ollama, and vLLM

llama.cpp, Ollama, and vLLM are not interchangeable. They solve different problems at different scales. This paper maps the architectural differences, performance characteristics, and deployment tradeoffs to help you pick the right engine for your workload — and understand why the wrong choice costs you in ways that are hard to undo.

May 25, 2026

inspiration

Quantization Is a Design Decision

Quantization is not just compression. It is a tradeoff you are making about accuracy, speed, and memory — and it belongs in your architecture docs, not your deployment scripts.

May 24, 2026

inspiration

Context Is Not Memory

A large context window does not make an LLM remember. It makes it attend. The distinction changes how you build.

May 23, 2026