devinfo.dev

inspiration

The Tokenizer Decides What the Model Can See

A model never sees characters or words — it sees tokens. The tokenizer is a lossy, fixed decision made before training, and it quietly shapes what the model is good and bad at: arithmetic, rare words, code, and non-English text all live or die by tokenization.

July 18, 2026

#tokenization #llm #nlp #engineering

inspiration

Fine-Tuning Is Usually Not the First Move

Reaching for fine-tuning to fix a model is often the expensive wrong turn. Most problems that look like they need fine-tuning are really retrieval or prompting problems. Fine-tuning changes behavior and style; it is a poor and costly way to inject knowledge.

July 17, 2026

#fine-tuning #rag #llm #engineering

inspiration

Continuous Batching Is Why Your Server Is Fast

Static batching wastes the GPU: every request waits for the slowest one in its batch to finish. Continuous batching — the idea behind Orca and vLLM — schedules at the token level instead of the request level, and it is the single biggest reason modern serving throughput is what it is.

July 16, 2026

#inference #batching #llm-serving #throughput

inspiration

Speculative Decoding Is a Bet on the Draft

Speculative decoding makes a large model generate faster by letting a small model guess ahead. It is lossless — the output is identical to decoding from the large model alone. But the entire speedup is a function of how often the draft is right, which makes the technique only as good as the match between your draft and target models.

July 15, 2026

#speculative-decoding #inference #llm-serving #latency

inspiration

Retrieval Is a Ranking Problem

Most RAG systems that disappoint are not failing at generation. They are failing at retrieval — and specifically at ranking. Swapping vector databases rarely fixes it. Two-stage retrieval and honest evaluation usually do.

July 14, 2026

#rag #retrieval #search #llm

inspiration

Quantization Is a Memory-Bandwidth Decision

Dropping a model from FP16 to INT4 is usually framed as a way to fit it in less VRAM. That is the smaller half of the story. When you serve a single stream, token generation is bound by memory bandwidth, not arithmetic — every token reads the entire model from memory once. Quantization shrinks that read, so it buys throughput, not just capacity.

July 13, 2026

#quantization #inference #llm-serving #self-hosted

inspiration

The Adapter Is Not the Model

The obvious way to serve ten fine-tuned variants is to run ten models. That is wrong. A LoRA adapter is a thin correction on top of a base model — and the base model is the same for all of them. Merging the adapter back into the weights before serving discards the one fact that makes multi-tenant fine-tuning cheap.

July 12, 2026

#lora #inference #llm-serving #self-hosted

inspiration

The Image Is a Token Budget

A multimodal model does not see an image. It reads a sequence of tokens — and the number of tokens an image produces is an engineering decision, not a fixed property of the image. Most engineers discover this only when their context window fills up.

July 11, 2026

#multimodal #vision-tokens #inference #llm-engineering

inspiration

The Embedding Is Not the Default

Every RAG system encodes text into vectors. The model that produces those vectors — and the dimensionality you accept from it — is an engineering decision. Most engineers make it once, at setup, and never revisit it. That is the wrong posture.

July 10, 2026

#embeddings #rag #matryoshka #retrieval

inspiration

Perplexity Is Not a Proxy

A model can assign low perplexity to a sequence it gets wrong. This is not an edge case — it is a theorem. Perplexity measures how surprised a model is by token sequences. It does not measure whether the model is right.

July 9, 2026

#evaluation #perplexity #llm-engineering #benchmarks

inspiration

The Grammar Is the Guarantee

Asking a model to output valid JSON is a prompt. Enforcing valid JSON at every decode step is an architecture decision. Constrained decoding masks invalid tokens before sampling — making malformed output structurally impossible, not just unlikely.

July 8, 2026

#constrained-decoding #structured-generation #inference #llm-engineering

inspiration

Calibration Is the Work

Round-to-nearest quantization distributes precision evenly across all weights. That is the wrong allocation. A tiny fraction of weight channels — the ones multiplied by large activations — dominate model output. AWQ finds those channels first. RTN never looks.

July 7, 2026

#quantization #awq #inference #model-compression

whitepaper

Synthetic Data for Fine-Tuning: The Engineering Guide

Training on AI-generated data is now the default path for open-model fine-tuning. The pattern works — but it has failure modes that are not visible in benchmark scores. This paper maps five practical methods (Self-Instruct, Evol-Instruct, Orca, phi, SPIN), the model collapse risk that applies to all of them, and the design checklist that keeps a synthetic data pipeline from degrading.

July 6, 2026

#fine-tuning #synthetic-data #llm-engineering #self-hosted

inspiration

The Protocol Is the Integration

Before MCP, every AI application that needed external tools built its own adapter. MCP replaces M×N custom integrations with a single standard — and the standard is not an API, it is a protocol. That distinction determines what you can build and how it composes.

July 6, 2026

#mcp #tool-use #agents #llm-engineering

inspiration

The Activation Is a Gate

Every modern open-weight LLM replaced its FFN activation function with SwiGLU. That swap added a third weight matrix, changed the hidden dimension arithmetic, and altered how information flows through every feed-forward layer. It is not a cosmetic improvement. It is a structural decision made once at training time that every inference system running the model must pay for.

July 5, 2026

#swiglu #ffn #inference #llm-architecture

inspiration

The Draft Model Does the Work

Speculative decoding uses a small draft model to propose tokens and a large model to verify them in parallel. The large model runs once per batch, not once per token. That single change converts a sequential bottleneck into a parallel verification step — and delivers 2–3x latency reduction at zero quality cost.

July 4, 2026

#speculative-decoding #inference #latency #llm-serving

inspiration

Position Is a Rotation

Every modern open-weight LLM encodes token position using rotation. Not addition. The shift from additive to multiplicative position encoding is not cosmetic — it changes what the model can generalize to, how far its context can be extended, and what hardware-efficient tricks remain available.

July 3, 2026

#positional-encoding #rope #inference #llm-engineering

inspiration

The Schema Is the Spec

JSON was designed for machine parsing, not language model interpretation. The tool schema you write is not a formality — it is the performance specification for function calling. A badly designed schema does not produce an error. It produces silent accuracy degradation, and the model takes the blame.

July 2, 2026

#tool-use #function-calling #llm-engineering #agents

inspiration

Position Is Not Neutral

A model that fits 128K tokens can still fail to use information you placed at token 60K. The context window is a capacity claim. Where you put information inside that window is a separate engineering decision — one with a measurable performance cost if you get it wrong.

July 1, 2026

#long-context #position-bias #rag #llm-engineering

inspiration

Thinking Tokens Are Compute

Test-time compute is a second axis of scaling, independent of model size. When a model generates reasoning tokens before its answer, it is not producing output — it is running computation. The cost of those tokens is the cost of a thinking process, and how you budget them determines what the model can solve.

June 30, 2026

#test-time-compute #reasoning #inference #llm-engineering

whitepaper

The Memory Wall: A Field Guide to LLM Inference on Consumer Hardware

LLM inference is not compute-bound. It is memory-bandwidth-bound. Understanding that single fact — and the arithmetic that follows from it — determines every sensible hardware and quantization decision you will make when running models on consumer devices.

June 29, 2026

#inference #consumer-hardware #quantization #memory-bandwidth

inspiration

The System Prompt Is Load-Bearing

The system prompt is not a style guide. It occupies a privileged position in the model's instruction hierarchy — one that was trained in, not just interpreted at runtime. Moving an instruction from system to user does not just change where it appears. It changes how much the model trusts it.

June 29, 2026

#llm-engineering #prompting #instruction-hierarchy #inference

inspiration

The Chat Template Is the Interface

Every model family uses a different format to structure conversations into tokens. The chat template — a Jinja2 program stored inside the model — encodes that format. Apply the wrong one and the model never sees a conversation. It sees a text blob. The degradation is silent, and the model gets the blame.

June 28, 2026

#inference #chat-template #llm-engineering #self-hosted

inspiration

The Scheduler Is Not the Model

You can swap in a faster model and get slower responses. The model is not the bottleneck — the scheduler is. How the serving system decides what to run, in what order, and in what chunks determines your latency and throughput.

June 27, 2026

#inference #llm-serving #scheduling #chunked-prefill

inspiration

Attention Heads Are Not Equal

Multi-head attention gives every query its own key and value heads. That is thorough — and expensive. Grouped-Query Attention proves the redundancy: Llama 3 70B serves 64 query heads from 8 KV heads, cuts its KV cache by 8x, and loses almost nothing in quality.

June 26, 2026

#inference #attention #gqa #kv-cache

inspiration

Sparsity Is Not Speed

You can remove 50% of a model's weights and make it slower. Sparsity is a mathematical property. Speed is a hardware property. Confusing them is one of the most expensive mistakes in applied model compression.

June 25, 2026

#pruning #inference #model-compression #hardware

inspiration

Distillation Is Not Compression

Quantization shrinks a model. Distillation trains a new one. The distinction is not semantic — it changes your compute budget, your deployment story, and what you can actually achieve at a given size.

June 24, 2026

#distillation #model-compression #llm-engineering #inference

inspiration

PagedAttention Is an OS Idea

Before PagedAttention, LLM serving systems wasted 60–80% of GPU memory on KV cache fragmentation. The fix was not a new neural architecture — it was a 1960s operating systems concept applied to the wrong layer.

June 23, 2026

#inference #pagedattention #vllm #memory-management

whitepaper

RAG Is a Retrieval Problem: Chunking, Indexing, and Why Engineers Get It Backwards

Most RAG failures happen before the LLM sees a single token. Chunking and indexing are not preprocessing steps — they are architectural decisions that determine what the model can possibly know. This paper maps the engineering decisions that actually matter: chunk strategy, index choice, hybrid retrieval, and the failure modes that remain invisible until production.

June 22, 2026

#rag #retrieval #chunking #indexing

inspiration

gather() Is Not Structured Concurrency

asyncio.gather() runs coroutines concurrently. asyncio.TaskGroup runs them with a defined lifetime, cancellation contract, and error propagation model. They are not the same tool. The difference matters the moment one task fails.

June 22, 2026

#async #python #llm-engineering #concurrency

inspiration

Prefill Is the Stall

The gap between submitting a prompt and receiving the first token is not network lag. It is compute. Prefill is a matrix multiplication over every token in your input — and it blocks decode entirely until it finishes.

June 21, 2026

#inference #prefill #ttft #llm-serving

inspiration

Alignment Is an Engineering Decision

RLHF, DPO, and GRPO are not synonyms for 'making the model safe.' They are distinct training algorithms with different memory costs, stability profiles, and data requirements. Picking the wrong one does not just waste compute — it produces a model optimized for the wrong objective.

June 20, 2026

#alignment #rlhf #dpo #llm-training

inspiration

Sparse Is Not Small

A model with 671 billion parameters can cost less to run than a 70 billion dense model. That is not a marketing claim — it is arithmetic. Mixture of Experts replaces a full forward pass with a routing decision, and the routing decision is the cost model.

June 19, 2026

#inference #mixture-of-experts #architecture #llm-serving

inspiration

Sampling Is a Filter

Top-k, top-p, and min-p are not interchangeable dials. Each one cuts the probability distribution at a different seam — and each has a failure mode that the others do not. Knowing which filter you are applying is a prerequisite to reasoning about your outputs.

June 18, 2026

#inference #sampling #llm-engineering #decoding

inspiration

The Router Is the System

Routing between models is not a configuration detail. It is a measurable, trainable system boundary — and treating it as one cuts inference costs by 40–85% without sacrificing quality.

June 17, 2026

#inference #llm-routing #llm-serving #cost-efficiency

inspiration

Compression Is Not Cheating

Sending fewer tokens to the model is not a workaround. It is engineering. The context window tells you the maximum; it does not tell you the optimum.

June 16, 2026

#inference #context-compression #llm-engineering #optimization

inspiration

The Prompt Is a Program

A prompt is not a suggestion — it is a specification. Prompts authored at design-time and executed against variable runtime input are software artifacts: they have bugs, require testing, demand versioning, and must be treated as code.

June 15, 2026

#prompting #prompt-engineering #llm-engineering #craft

whitepaper

The While Loop Is the Easy Part: Engineering Agents for Production

Every LLM agent converges on the same structure: call the model, execute tools, repeat. That loop is not where the engineering lives. The hard parts are termination conditions, context budget management, error classification, tool safety rails, and observability infrastructure — and most agents that fail in production fail there, not in the model.

June 15, 2026

#agents #llm-serving #engineering #observability

inspiration

Long Context Is Not Long Attention

Expanding a model's context window does not guarantee it attends to all of that context. The window is a capacity claim. Attention quality across that capacity is a separate, structural problem — and it degrades in ways that are not visible in perplexity scores.

June 14, 2026

#long-context #attention #inference #llm-engineering

inspiration

Parallelism Is a Topology Decision

Tensor parallelism and pipeline parallelism are not interchangeable scaling knobs. They encode different assumptions about your hardware, your model shape, and what you are optimizing for. Choosing wrong does not just waste GPUs — it locks in a latency-throughput tradeoff you did not knowingly make.

June 13, 2026

#inference #parallelism #distributed-systems #llm-serving

inspiration

Merging Is Not Training

Model merging combines two or more fine-tuned LLMs into a single model without any gradient updates. No data. No compute budget. No training run. The result inherits capabilities from every source model — if you pick the right algorithm.

June 12, 2026

#model-merging #mergekit #fine-tuning #inference

inspiration

The Tokenizer Is the Bug

Every LLM failure starts with the same invisible step: tokenization. It runs before inference, produces no logs, and degrades outputs silently. Most debugging sessions end at the model. They should start at the tokenizer.

June 11, 2026

#tokenization #llm #inference #engineering

inspiration

GGUF Is a Container, Not Just Weights

Every self-hosted AI practitioner downloads .gguf files. Few understand what they are. GGUF is not a weight dump — it is a self-contained container that carries the model, the tokenizer, the quantization scheme, and the chat template in a single file. That design decision changed how open-source models are distributed.

June 10, 2026

#gguf #inference #llama-cpp #self-hosted

inspiration

Flash Attention Is an IO Problem

Standard attention is slow not because of arithmetic — it is slow because of memory traffic. Flash Attention solves the IO problem, not the compute problem. That distinction matters for how you think about every inference optimization that follows.

June 9, 2026

#inference #attention #transformers #gpu

whitepaper

Evals Are Not Optional

Benchmark scores are not evaluations. Contamination is widespread, Goodhart's Law is in effect, and the gap between a leaderboard number and production behaviour is unbridged without a real eval pipeline. This paper defines what evals are, why the major benchmarks are unreliable in isolation, and how to build an evaluation practice that actually catches failures.

June 8, 2026

#evals #benchmarks #llm #engineering

inspiration

Prefix Caching Is Free Throughput

Automatic Prefix Caching in vLLM reuses already-computed KV cache blocks across requests that share identical prefixes — delivering 30–50% throughput gains and up to 10x latency reduction at zero engineering cost beyond a single configuration flag.

June 8, 2026

#inference #vllm #performance #prefix-caching

inspiration

Continuous Batching: The Throughput Multiplier

Static batching wastes GPU cycles waiting for the slowest request. Continuous batching fixes this at the scheduler level — and the gains are not marginal.

June 7, 2026

#inference #vllm #throughput #self-hosted

inspiration

LoRA Is Not Fine-Tuning

LoRA does not update your model. It adds a thin, low-rank correction on top — and that distinction changes how you think about deployment, switching, and scale.

June 5, 2026

#lora #fine-tuning #inference #model-adaptation

inspiration

Steering Is Not Prompting

Prompts influence what a model says. Activation steering changes what the model is, mid-inference. They are not the same tool.

June 4, 2026

#inference #mechanistic-interpretability #activation-steering #llm-internals

inspiration

The KV Cache Is Your Real Memory Budget

The KV cache — not the model weights — is what limits how many tokens you can generate and how many requests you can serve. Understanding it changes how you provision hardware and tune inference.

June 3, 2026

#inference #kv-cache #memory #llm-serving

inspiration

Attention Sinks: The Tokens That Hold Everything Together

Transformers quietly route a disproportionate share of attention to their first tokens — not because those tokens are important, but because softmax needs somewhere to put mass. Understanding this changes how you think about KV cache design.

June 2, 2026

#inference #transformers #kv-cache #attention

inspiration

The Tool Is Not the Model

A language model does not execute functions. It describes them. The execution lives elsewhere — in your code, your runtime, your responsibility.

June 1, 2026

#tool-use #function-calling #agents #llm

whitepaper

Fine-Tuning, RAG, or Prompting: An Engineering Decision

Three techniques can improve LLM output quality: prompt engineering, retrieval-augmented generation, and fine-tuning. Each solves a different problem. Choosing the wrong one wastes months and produces worse results than the right one done simply.

June 1, 2026

#fine-tuning #rag #prompt-engineering #llm

inspiration

Embeddings Are Not Optional

Every RAG pipeline, semantic search index, and similarity feature runs on embeddings. The generation model gets the credit. The embedding model does the work.

May 31, 2026

#embeddings #rag #local-inference #vector-search

inspiration

Temperature Is Not Creativity

Temperature is a probability reshaper, not a creativity dial. Calling it a creativity parameter is a category error — one that leads to misconfigured systems and wasted inference budget.

May 30, 2026

#inference #sampling #llm #engineering

inspiration

Retrieval Is the Weakest Link

RAG systems fail at retrieval, not generation. Engineers blame the LLM. The problem is upstream.

May 29, 2026

#rag #retrieval #embeddings #ai-engineering

inspiration

Prompt Caching Is Free Money

Every time your app resends the same system prompt, you pay to compute it again. Prompt caching eliminates that cost by reusing precomputed KV tensors across requests. It requires no code changes and delivers up to 90% input token savings.

May 28, 2026

#inference #optimization #cost #llm

booklet

The LocalLLM Engine Stack: One API, Multiple Backends, Zero Lock-in

A single OpenAI-compatible endpoint that routes across Ollama, llama.cpp, and FreeLLMAPI with automatic failover. This booklet documents the architecture, routing logic, and deployment of the localllm-engine.

May 27, 2026

#localllm-engine #inference #routing #self-hosted #architecture

inspiration

Structured Outputs Are a Contract

Constrained generation is not a convenience feature. It is a systems boundary — a contract between your model and every downstream component that consumes its output.

May 27, 2026

#structured-outputs #constrained-decoding #inference #llm-engineering

booklet

OpenCode with Local Models: Pointing Your Coding Agent at Your Own Inference

OpenCode is a terminal-first AI coding agent. It expects cloud APIs by default. This booklet shows how to wire it to Ollama, vLLM, or any OpenAI-compatible local endpoint — and what breaks when you do.

May 27, 2026

#opencode #ollama #coding-agent #local-inference #self-hosted

booklet

Ollama Beyond Defaults: Custom Model Paths on Windows and WSL

Ollama assumes default paths. When your models live elsewhere, the documentation stops helping. This booklet covers every configuration path for Windows native, WSL2, and cross-boundary access.

May 27, 2026

#ollama #windows #wsl #self-hosted #inference

booklet

From Free Tier to Sovereignty: Running Inference on Cloud ARM Instances

Free tier cloud compute promises self-hosted AI. The reality is capacity lotteries, region lock-in, and silent deprecation. This booklet documents what actually works, what does not, and how to build an inference setup that survives policy changes.

May 27, 2026

#cloud #arm #oci #sovereignty #inference #self-hosted

inspiration

The Model Is Not the Agent

An LLM does not call tools. It requests them. The loop is the agent — and most broken agents are broken loops, not broken models.

May 26, 2026

#tool-use #agents #llm #engineering

inspiration

Speculative Decoding: The Free Tokens

Speculative decoding cuts inference latency 2–3x without changing a single output token. The gain is real. So is the catch.

May 25, 2026

#inference #llm #optimization #latency

whitepaper

Choosing Your Inference Engine: llama.cpp, Ollama, and vLLM

llama.cpp, Ollama, and vLLM are not interchangeable. They solve different problems at different scales. This paper maps the architectural differences, performance characteristics, and deployment tradeoffs to help you pick the right engine for your workload — and understand why the wrong choice costs you in ways that are hard to undo.

May 25, 2026

#inference #llm #vllm #llama.cpp #ollama #self-hosted

inspiration

Your Infrastructure Should Not Need Permission

If a vendor's policy change can delete your workload overnight, you do not have infrastructure. You have a lease.

May 24, 2026

#sovereignty #cloud #infrastructure

inspiration

Quantization Is a Design Decision

Quantization is not just compression. It is a tradeoff you are making about accuracy, speed, and memory — and it belongs in your architecture docs, not your deployment scripts.

May 24, 2026

#quantization #inference #llm #systems

inspiration