#llm — devinfo.dev

inspiration

Latency and Throughput Are Not the Same Goal

Two systems serving the same model can feel completely different because they optimize opposite things.

July 19, 2026

inspiration

The Tokenizer Decides What the Model Can See

A model never sees characters or words — it sees tokens. The tokenizer is a lossy, fixed decision made before training, and it quietly shapes what the model is good and bad at: arithmetic, rare words, code, and non-English text all live or die by tokenization.

July 18, 2026

inspiration

Fine-Tuning Is Usually Not the First Move

Reaching for fine-tuning to fix a model is often the expensive wrong turn. Most problems that look like they need fine-tuning are really retrieval or prompting problems. Fine-tuning changes behavior and style; it is a poor and costly way to inject knowledge.

July 17, 2026

inspiration

Retrieval Is a Ranking Problem

Most RAG systems that disappoint are not failing at generation. They are failing at retrieval — and specifically at ranking. Swapping vector databases rarely fixes it. Two-stage retrieval and honest evaluation usually do.

July 14, 2026

inspiration

The Tokenizer Is the Bug

Every LLM failure starts with the same invisible step: tokenization. It runs before inference, produces no logs, and degrades outputs silently. Most debugging sessions end at the model. They should start at the tokenizer.

June 11, 2026

whitepaper

Evals Are Not Optional

Benchmark scores are not evaluations. Contamination is widespread, Goodhart's Law is in effect, and the gap between a leaderboard number and production behaviour is unbridged without a real eval pipeline. This paper defines what evals are, why the major benchmarks are unreliable in isolation, and how to build an evaluation practice that actually catches failures.

June 8, 2026

inspiration

The Tool Is Not the Model

A language model does not execute functions. It describes them. The execution lives elsewhere — in your code, your runtime, your responsibility.

June 1, 2026

whitepaper

Fine-Tuning, RAG, or Prompting: An Engineering Decision

Three techniques can improve LLM output quality: prompt engineering, retrieval-augmented generation, and fine-tuning. Each solves a different problem. Choosing the wrong one wastes months and produces worse results than the right one done simply.

June 1, 2026

inspiration

Temperature Is Not Creativity

Temperature is a probability reshaper, not a creativity dial. Calling it a creativity parameter is a category error — one that leads to misconfigured systems and wasted inference budget.

May 30, 2026

inspiration

Prompt Caching Is Free Money

Every time your app resends the same system prompt, you pay to compute it again. Prompt caching eliminates that cost by reusing precomputed KV tensors across requests. It requires no code changes and delivers up to 90% input token savings.

May 28, 2026

inspiration

The Model Is Not the Agent

An LLM does not call tools. It requests them. The loop is the agent — and most broken agents are broken loops, not broken models.

May 26, 2026

inspiration

Speculative Decoding: The Free Tokens

Speculative decoding cuts inference latency 2–3x without changing a single output token. The gain is real. So is the catch.

May 25, 2026

whitepaper

Choosing Your Inference Engine: llama.cpp, Ollama, and vLLM

llama.cpp, Ollama, and vLLM are not interchangeable. They solve different problems at different scales. This paper maps the architectural differences, performance characteristics, and deployment tradeoffs to help you pick the right engine for your workload — and understand why the wrong choice costs you in ways that are hard to undo.

May 25, 2026

inspiration

Quantization Is a Design Decision

Quantization is not just compression. It is a tradeoff you are making about accuracy, speed, and memory — and it belongs in your architecture docs, not your deployment scripts.

May 24, 2026

inspiration

Context Is Not Memory

A large context window does not make an LLM remember. It makes it attend. The distinction changes how you build.

May 23, 2026