#inference
12 papers
-
inspiration
The KV Cache Is Your Real Memory Budget
The KV cache — not the model weights — is what limits how many tokens you can generate and how many requests you can serve. Understanding it changes how you provision hardware and tune inference.
-
inspiration
Attention Sinks: The Tokens That Hold Everything Together
Transformers quietly route a disproportionate share of attention to their first tokens — not because those tokens are important, but because softmax needs somewhere to put mass. Understanding this changes how you think about KV cache design.
-
inspiration
Temperature Is Not Creativity
Temperature is a probability reshaper, not a creativity dial. Calling it a creativity parameter is a category error — one that leads to misconfigured systems and wasted inference budget.
-
inspiration
Prompt Caching Is Free Money
Every time your app resends the same system prompt, you pay to compute it again. Prompt caching eliminates that cost by reusing precomputed KV tensors across requests. It requires no code changes and delivers up to 90% input token savings.
-
booklet
From Free Tier to Sovereignty: Running Inference on Cloud ARM Instances
Free tier cloud compute promises self-hosted AI. The reality is capacity lotteries, region lock-in, and silent deprecation. This booklet documents what actually works, what does not, and how to build an inference setup that survives policy changes.
-
booklet
Ollama Beyond Defaults: Custom Model Paths on Windows and WSL
Ollama assumes default paths. When your models live elsewhere, the documentation stops helping. This booklet covers every configuration path for Windows native, WSL2, and cross-boundary access.
-
inspiration
Structured Outputs Are a Contract
Constrained generation is not a convenience feature. It is a systems boundary — a contract between your model and every downstream component that consumes its output.
-
booklet
The LocalLLM Engine Stack: One API, Multiple Backends, Zero Lock-in
A single OpenAI-compatible endpoint that routes across Ollama, llama.cpp, and FreeLLMAPI with automatic failover. This booklet documents the architecture, routing logic, and deployment of the localllm-engine.
-
whitepaper
Choosing Your Inference Engine: llama.cpp, Ollama, and vLLM
llama.cpp, Ollama, and vLLM are not interchangeable. They solve different problems at different scales. This paper maps the architectural differences, performance characteristics, and deployment tradeoffs to help you pick the right engine for your workload — and understand why the wrong choice costs you in ways that are hard to undo.
-
inspiration
Speculative Decoding: The Free Tokens
Speculative decoding cuts inference latency 2–3x without changing a single output token. The gain is real. So is the catch.
-
inspiration
Quantization Is a Design Decision
Quantization is not just compression. It is a tradeoff you are making about accuracy, speed, and memory — and it belongs in your architecture docs, not your deployment scripts.
-
inspiration
Context Is Not Memory
A large context window does not make an LLM remember. It makes it attend. The distinction changes how you build.