#self-hosted — devinfo.dev

inspiration
Quantization Is a Memory-Bandwidth Decision

Dropping a model from FP16 to INT4 is usually framed as a way to fit it in less VRAM. That is the smaller half of the story. When you serve a single stream, token generation is bound by memory bandwidth, not arithmetic — every token reads the entire model from memory once. Quantization shrinks that read, so it buys throughput, not just capacity.
July 13, 2026
inspiration
The Adapter Is Not the Model

The obvious way to serve ten fine-tuned variants is to run ten models. That is wrong. A LoRA adapter is a thin correction on top of a base model — and the base model is the same for all of them. Merging the adapter back into the weights before serving discards the one fact that makes multi-tenant fine-tuning cheap.
July 12, 2026
whitepaper
Synthetic Data for Fine-Tuning: The Engineering Guide

Training on AI-generated data is now the default path for open-model fine-tuning. The pattern works — but it has failure modes that are not visible in benchmark scores. This paper maps five practical methods (Self-Instruct, Evol-Instruct, Orca, phi, SPIN), the model collapse risk that applies to all of them, and the design checklist that keeps a synthetic data pipeline from degrading.
July 6, 2026
inspiration
The Chat Template Is the Interface

Every model family uses a different format to structure conversations into tokens. The chat template — a Jinja2 program stored inside the model — encodes that format. Apply the wrong one and the model never sees a conversation. It sees a text blob. The degradation is silent, and the model gets the blame.
June 28, 2026
inspiration
GGUF Is a Container, Not Just Weights

Every self-hosted AI practitioner downloads .gguf files. Few understand what they are. GGUF is not a weight dump — it is a self-contained container that carries the model, the tokenizer, the quantization scheme, and the chat template in a single file. That design decision changed how open-source models are distributed.
June 10, 2026
inspiration
Continuous Batching: The Throughput Multiplier

Static batching wastes GPU cycles waiting for the slowest request. Continuous batching fixes this at the scheduler level — and the gains are not marginal.
June 7, 2026
booklet
The LocalLLM Engine Stack: One API, Multiple Backends, Zero Lock-in

A single OpenAI-compatible endpoint that routes across Ollama, llama.cpp, and FreeLLMAPI with automatic failover. This booklet documents the architecture, routing logic, and deployment of the localllm-engine.
May 27, 2026
booklet
OpenCode with Local Models: Pointing Your Coding Agent at Your Own Inference

OpenCode is a terminal-first AI coding agent. It expects cloud APIs by default. This booklet shows how to wire it to Ollama, vLLM, or any OpenAI-compatible local endpoint — and what breaks when you do.
May 27, 2026
booklet
Ollama Beyond Defaults: Custom Model Paths on Windows and WSL

Ollama assumes default paths. When your models live elsewhere, the documentation stops helping. This booklet covers every configuration path for Windows native, WSL2, and cross-boundary access.
May 27, 2026
booklet
From Free Tier to Sovereignty: Running Inference on Cloud ARM Instances

Free tier cloud compute promises self-hosted AI. The reality is capacity lotteries, region lock-in, and silent deprecation. This booklet documents what actually works, what does not, and how to build an inference setup that survives policy changes.
May 27, 2026
whitepaper
Choosing Your Inference Engine: llama.cpp, Ollama, and vLLM

llama.cpp, Ollama, and vLLM are not interchangeable. They solve different problems at different scales. This paper maps the architectural differences, performance characteristics, and deployment tradeoffs to help you pick the right engine for your workload — and understand why the wrong choice costs you in ways that are hard to undo.
May 25, 2026