Papers

inspiration

Latency and Throughput Are Not the Same Goal

Two systems serving the same model can feel completely different because they optimize opposite things.

July 19, 2026

inspiration

The Tokenizer Decides What the Model Can See

A model never sees characters or words — it sees tokens. The tokenizer is a lossy, fixed decision made before training, and it quietly shapes what the model is good and bad at: arithmetic, rare words, code, and non-English text all live or die by tokenization.

July 18, 2026

inspiration

Fine-Tuning Is Usually Not the First Move

Reaching for fine-tuning to fix a model is often the expensive wrong turn. Most problems that look like they need fine-tuning are really retrieval or prompting problems. Fine-tuning changes behavior and style; it is a poor and costly way to inject knowledge.

July 17, 2026

inspiration

Continuous Batching Is Why Your Server Is Fast

Static batching wastes the GPU: every request waits for the slowest one in its batch to finish. Continuous batching — the idea behind Orca and vLLM — schedules at the token level instead of the request level, and it is the single biggest reason modern serving throughput is what it is.

July 16, 2026

inspiration

Speculative Decoding Is a Bet on the Draft

Speculative decoding makes a large model generate faster by letting a small model guess ahead. It is lossless — the output is identical to decoding from the large model alone. But the entire speedup is a function of how often the draft is right, which makes the technique only as good as the match between your draft and target models.

July 15, 2026

inspiration

Retrieval Is a Ranking Problem

Most RAG systems that disappoint are not failing at generation. They are failing at retrieval — and specifically at ranking. Swapping vector databases rarely fixes it. Two-stage retrieval and honest evaluation usually do.

July 14, 2026

inspiration

Quantization Is a Memory-Bandwidth Decision

Dropping a model from FP16 to INT4 is usually framed as a way to fit it in less VRAM. That is the smaller half of the story. When you serve a single stream, token generation is bound by memory bandwidth, not arithmetic — every token reads the entire model from memory once. Quantization shrinks that read, so it buys throughput, not just capacity.

July 13, 2026

inspiration

The Adapter Is Not the Model

The obvious way to serve ten fine-tuned variants is to run ten models. That is wrong. A LoRA adapter is a thin correction on top of a base model — and the base model is the same for all of them. Merging the adapter back into the weights before serving discards the one fact that makes multi-tenant fine-tuning cheap.

July 12, 2026

inspiration

The Image Is a Token Budget

A multimodal model does not see an image. It reads a sequence of tokens — and the number of tokens an image produces is an engineering decision, not a fixed property of the image. Most engineers discover this only when their context window fills up.

July 11, 2026

inspiration

The Embedding Is Not the Default

Every RAG system encodes text into vectors. The model that produces those vectors — and the dimensionality you accept from it — is an engineering decision. Most engineers make it once, at setup, and never revisit it. That is the wrong posture.

July 10, 2026

inspiration

Perplexity Is Not a Proxy

A model can assign low perplexity to a sequence it gets wrong. This is not an edge case — it is a theorem. Perplexity measures how surprised a model is by token sequences. It does not measure whether the model is right.

July 9, 2026

inspiration

The Grammar Is the Guarantee

Asking a model to output valid JSON is a prompt. Enforcing valid JSON at every decode step is an architecture decision. Constrained decoding masks invalid tokens before sampling — making malformed output structurally impossible, not just unlikely.

July 8, 2026

inspiration

Calibration Is the Work

Round-to-nearest quantization distributes precision evenly across all weights. That is the wrong allocation. A tiny fraction of weight channels — the ones multiplied by large activations — dominate model output. AWQ finds those channels first. RTN never looks.

July 7, 2026

inspiration

The Protocol Is the Integration

Before MCP, every AI application that needed external tools built its own adapter. MCP replaces M×N custom integrations with a single standard — and the standard is not an API, it is a protocol. That distinction determines what you can build and how it composes.

July 6, 2026

inspiration

The Activation Is a Gate

Every modern open-weight LLM replaced its FFN activation function with SwiGLU. That swap added a third weight matrix, changed the hidden dimension arithmetic, and altered how information flows through every feed-forward layer. It is not a cosmetic improvement. It is a structural decision made once at training time that every inference system running the model must pay for.

July 5, 2026

inspiration

The Draft Model Does the Work

Speculative decoding uses a small draft model to propose tokens and a large model to verify them in parallel. The large model runs once per batch, not once per token. That single change converts a sequential bottleneck into a parallel verification step — and delivers 2–3x latency reduction at zero quality cost.

July 4, 2026

inspiration

Position Is a Rotation

Every modern open-weight LLM encodes token position using rotation. Not addition. The shift from additive to multiplicative position encoding is not cosmetic — it changes what the model can generalize to, how far its context can be extended, and what hardware-efficient tricks remain available.

July 3, 2026

inspiration

The Schema Is the Spec

JSON was designed for machine parsing, not language model interpretation. The tool schema you write is not a formality — it is the performance specification for function calling. A badly designed schema does not produce an error. It produces silent accuracy degradation, and the model takes the blame.

July 2, 2026

inspiration

Position Is Not Neutral

A model that fits 128K tokens can still fail to use information you placed at token 60K. The context window is a capacity claim. Where you put information inside that window is a separate engineering decision — one with a measurable performance cost if you get it wrong.

July 1, 2026

inspiration

Thinking Tokens Are Compute

Test-time compute is a second axis of scaling, independent of model size. When a model generates reasoning tokens before its answer, it is not producing output — it is running computation. The cost of those tokens is the cost of a thinking process, and how you budget them determines what the model can solve.

June 30, 2026

inspiration

The System Prompt Is Load-Bearing

The system prompt is not a style guide. It occupies a privileged position in the model's instruction hierarchy — one that was trained in, not just interpreted at runtime. Moving an instruction from system to user does not just change where it appears. It changes how much the model trusts it.

June 29, 2026

inspiration

The Chat Template Is the Interface

Every model family uses a different format to structure conversations into tokens. The chat template — a Jinja2 program stored inside the model — encodes that format. Apply the wrong one and the model never sees a conversation. It sees a text blob. The degradation is silent, and the model gets the blame.

June 28, 2026

inspiration

The Scheduler Is Not the Model

You can swap in a faster model and get slower responses. The model is not the bottleneck — the scheduler is. How the serving system decides what to run, in what order, and in what chunks determines your latency and throughput.

June 27, 2026

inspiration

Attention Heads Are Not Equal

Multi-head attention gives every query its own key and value heads. That is thorough — and expensive. Grouped-Query Attention proves the redundancy: Llama 3 70B serves 64 query heads from 8 KV heads, cuts its KV cache by 8x, and loses almost nothing in quality.

June 26, 2026

inspiration

Sparsity Is Not Speed

You can remove 50% of a model's weights and make it slower. Sparsity is a mathematical property. Speed is a hardware property. Confusing them is one of the most expensive mistakes in applied model compression.

June 25, 2026

inspiration

Distillation Is Not Compression

Quantization shrinks a model. Distillation trains a new one. The distinction is not semantic — it changes your compute budget, your deployment story, and what you can actually achieve at a given size.

June 24, 2026

inspiration

PagedAttention Is an OS Idea

Before PagedAttention, LLM serving systems wasted 60–80% of GPU memory on KV cache fragmentation. The fix was not a new neural architecture — it was a 1960s operating systems concept applied to the wrong layer.

June 23, 2026

inspiration

gather() Is Not Structured Concurrency

asyncio.gather() runs coroutines concurrently. asyncio.TaskGroup runs them with a defined lifetime, cancellation contract, and error propagation model. They are not the same tool. The difference matters the moment one task fails.

June 22, 2026

inspiration

Prefill Is the Stall

The gap between submitting a prompt and receiving the first token is not network lag. It is compute. Prefill is a matrix multiplication over every token in your input — and it blocks decode entirely until it finishes.

June 21, 2026

inspiration

Alignment Is an Engineering Decision

RLHF, DPO, and GRPO are not synonyms for 'making the model safe.' They are distinct training algorithms with different memory costs, stability profiles, and data requirements. Picking the wrong one does not just waste compute — it produces a model optimized for the wrong objective.

June 20, 2026

inspiration

Sparse Is Not Small

A model with 671 billion parameters can cost less to run than a 70 billion dense model. That is not a marketing claim — it is arithmetic. Mixture of Experts replaces a full forward pass with a routing decision, and the routing decision is the cost model.

June 19, 2026

inspiration

Sampling Is a Filter

Top-k, top-p, and min-p are not interchangeable dials. Each one cuts the probability distribution at a different seam — and each has a failure mode that the others do not. Knowing which filter you are applying is a prerequisite to reasoning about your outputs.

June 18, 2026

inspiration

The Router Is the System

Routing between models is not a configuration detail. It is a measurable, trainable system boundary — and treating it as one cuts inference costs by 40–85% without sacrificing quality.

June 17, 2026

inspiration

Compression Is Not Cheating

Sending fewer tokens to the model is not a workaround. It is engineering. The context window tells you the maximum; it does not tell you the optimum.

June 16, 2026

inspiration

The Prompt Is a Program

A prompt is not a suggestion — it is a specification. Prompts authored at design-time and executed against variable runtime input are software artifacts: they have bugs, require testing, demand versioning, and must be treated as code.

June 15, 2026

inspiration

Long Context Is Not Long Attention

Expanding a model's context window does not guarantee it attends to all of that context. The window is a capacity claim. Attention quality across that capacity is a separate, structural problem — and it degrades in ways that are not visible in perplexity scores.

June 14, 2026

inspiration

Parallelism Is a Topology Decision

Tensor parallelism and pipeline parallelism are not interchangeable scaling knobs. They encode different assumptions about your hardware, your model shape, and what you are optimizing for. Choosing wrong does not just waste GPUs — it locks in a latency-throughput tradeoff you did not knowingly make.

June 13, 2026

inspiration

Merging Is Not Training

Model merging combines two or more fine-tuned LLMs into a single model without any gradient updates. No data. No compute budget. No training run. The result inherits capabilities from every source model — if you pick the right algorithm.

June 12, 2026

inspiration

The Tokenizer Is the Bug

Every LLM failure starts with the same invisible step: tokenization. It runs before inference, produces no logs, and degrades outputs silently. Most debugging sessions end at the model. They should start at the tokenizer.

June 11, 2026

inspiration

GGUF Is a Container, Not Just Weights

Every self-hosted AI practitioner downloads .gguf files. Few understand what they are. GGUF is not a weight dump — it is a self-contained container that carries the model, the tokenizer, the quantization scheme, and the chat template in a single file. That design decision changed how open-source models are distributed.

June 10, 2026

inspiration

Flash Attention Is an IO Problem

Standard attention is slow not because of arithmetic — it is slow because of memory traffic. Flash Attention solves the IO problem, not the compute problem. That distinction matters for how you think about every inference optimization that follows.

June 9, 2026

inspiration

Prefix Caching Is Free Throughput

Automatic Prefix Caching in vLLM reuses already-computed KV cache blocks across requests that share identical prefixes — delivering 30–50% throughput gains and up to 10x latency reduction at zero engineering cost beyond a single configuration flag.

June 8, 2026

inspiration

Continuous Batching: The Throughput Multiplier

Static batching wastes GPU cycles waiting for the slowest request. Continuous batching fixes this at the scheduler level — and the gains are not marginal.

June 7, 2026

inspiration

LoRA Is Not Fine-Tuning

LoRA does not update your model. It adds a thin, low-rank correction on top — and that distinction changes how you think about deployment, switching, and scale.

June 5, 2026

inspiration

Steering Is Not Prompting

Prompts influence what a model says. Activation steering changes what the model is, mid-inference. They are not the same tool.

June 4, 2026

inspiration

The KV Cache Is Your Real Memory Budget

The KV cache — not the model weights — is what limits how many tokens you can generate and how many requests you can serve. Understanding it changes how you provision hardware and tune inference.

June 3, 2026

inspiration

Attention Sinks: The Tokens That Hold Everything Together

Transformers quietly route a disproportionate share of attention to their first tokens — not because those tokens are important, but because softmax needs somewhere to put mass. Understanding this changes how you think about KV cache design.

June 2, 2026

inspiration

The Tool Is Not the Model

A language model does not execute functions. It describes them. The execution lives elsewhere — in your code, your runtime, your responsibility.

June 1, 2026

inspiration

Embeddings Are Not Optional

Every RAG pipeline, semantic search index, and similarity feature runs on embeddings. The generation model gets the credit. The embedding model does the work.

May 31, 2026

inspiration

Temperature Is Not Creativity

Temperature is a probability reshaper, not a creativity dial. Calling it a creativity parameter is a category error — one that leads to misconfigured systems and wasted inference budget.

May 30, 2026

inspiration

Retrieval Is the Weakest Link

RAG systems fail at retrieval, not generation. Engineers blame the LLM. The problem is upstream.

May 29, 2026

inspiration

Prompt Caching Is Free Money

Every time your app resends the same system prompt, you pay to compute it again. Prompt caching eliminates that cost by reusing precomputed KV tensors across requests. It requires no code changes and delivers up to 90% input token savings.

May 28, 2026

inspiration

Structured Outputs Are a Contract

Constrained generation is not a convenience feature. It is a systems boundary — a contract between your model and every downstream component that consumes its output.

May 27, 2026

inspiration

The Model Is Not the Agent

An LLM does not call tools. It requests them. The loop is the agent — and most broken agents are broken loops, not broken models.

May 26, 2026

inspiration

Speculative Decoding: The Free Tokens

Speculative decoding cuts inference latency 2–3x without changing a single output token. The gain is real. So is the catch.

May 25, 2026

inspiration

Your Infrastructure Should Not Need Permission

If a vendor's policy change can delete your workload overnight, you do not have infrastructure. You have a lease.

May 24, 2026

inspiration

Quantization Is a Design Decision

Quantization is not just compression. It is a tradeoff you are making about accuracy, speed, and memory — and it belongs in your architecture docs, not your deployment scripts.

May 24, 2026

inspiration

The Cost of Abstraction

Every layer you add is a layer someone else must debug.

May 23, 2026

inspiration

Context Is Not Memory

A large context window does not make an LLM remember. It makes it attend. The distinction changes how you build.

May 23, 2026

inspiration

Clarity is Kindness

Clear writing is not a style preference. It is a form of respect.

May 20, 2026