#llm
9 papers
-
whitepaper
Fine-Tuning, RAG, or Prompting: An Engineering Decision
Three techniques can improve LLM output quality: prompt engineering, retrieval-augmented generation, and fine-tuning. Each solves a different problem. Choosing the wrong one wastes months and produces worse results than the right one done simply.
-
inspiration
The Tool Is Not the Model
A language model does not execute functions. It describes them. The execution lives elsewhere — in your code, your runtime, your responsibility.
-
inspiration
Temperature Is Not Creativity
Temperature is a probability reshaper, not a creativity dial. Calling it a creativity parameter is a category error — one that leads to misconfigured systems and wasted inference budget.
-
inspiration
Prompt Caching Is Free Money
Every time your app resends the same system prompt, you pay to compute it again. Prompt caching eliminates that cost by reusing precomputed KV tensors across requests. It requires no code changes and delivers up to 90% input token savings.
-
inspiration
The Model Is Not the Agent
An LLM does not call tools. It requests them. The loop is the agent — and most broken agents are broken loops, not broken models.
-
whitepaper
Choosing Your Inference Engine: llama.cpp, Ollama, and vLLM
llama.cpp, Ollama, and vLLM are not interchangeable. They solve different problems at different scales. This paper maps the architectural differences, performance characteristics, and deployment tradeoffs to help you pick the right engine for your workload — and understand why the wrong choice costs you in ways that are hard to undo.
-
inspiration
Speculative Decoding: The Free Tokens
Speculative decoding cuts inference latency 2–3x without changing a single output token. The gain is real. So is the catch.
-
inspiration
Quantization Is a Design Decision
Quantization is not just compression. It is a tradeoff you are making about accuracy, speed, and memory — and it belongs in your architecture docs, not your deployment scripts.
-
inspiration
Context Is Not Memory
A large context window does not make an LLM remember. It makes it attend. The distinction changes how you build.