papers
-
inspiration
The KV Cache Is Your Real Memory Budget
The KV cache — not the model weights — is what limits how many tokens you can generate and how many requests you can serve. Understanding it changes how you provision hardware and tune inference.
-
inspiration
Attention Sinks: The Tokens That Hold Everything Together
Transformers quietly route a disproportionate share of attention to their first tokens — not because those tokens are important, but because softmax needs somewhere to put mass. Understanding this changes how you think about KV cache design.
-
inspiration
The Tool Is Not the Model
A language model does not execute functions. It describes them. The execution lives elsewhere — in your code, your runtime, your responsibility.
-
inspiration
Embeddings Are Not Optional
Every RAG pipeline, semantic search index, and similarity feature runs on embeddings. The generation model gets the credit. The embedding model does the work.
-
inspiration
Temperature Is Not Creativity
Temperature is a probability reshaper, not a creativity dial. Calling it a creativity parameter is a category error — one that leads to misconfigured systems and wasted inference budget.
-
inspiration
Retrieval Is the Weakest Link
RAG systems fail at retrieval, not generation. Engineers blame the LLM. The problem is upstream.
-
inspiration
Prompt Caching Is Free Money
Every time your app resends the same system prompt, you pay to compute it again. Prompt caching eliminates that cost by reusing precomputed KV tensors across requests. It requires no code changes and delivers up to 90% input token savings.
-
inspiration
Structured Outputs Are a Contract
Constrained generation is not a convenience feature. It is a systems boundary — a contract between your model and every downstream component that consumes its output.
-
inspiration
The Model Is Not the Agent
An LLM does not call tools. It requests them. The loop is the agent — and most broken agents are broken loops, not broken models.
-
inspiration
Speculative Decoding: The Free Tokens
Speculative decoding cuts inference latency 2–3x without changing a single output token. The gain is real. So is the catch.
-
inspiration
Quantization Is a Design Decision
Quantization is not just compression. It is a tradeoff you are making about accuracy, speed, and memory — and it belongs in your architecture docs, not your deployment scripts.
-
inspiration
Your Infrastructure Should Not Need Permission
If a vendor's policy change can delete your workload overnight, you do not have infrastructure. You have a lease.
-
inspiration
Context Is Not Memory
A large context window does not make an LLM remember. It makes it attend. The distinction changes how you build.
-
inspiration
The Cost of Abstraction
Every layer you add is a layer someone else must debug.
-
inspiration
Clarity is Kindness
Clear writing is not a style preference. It is a form of respect.