The LocalLLM Engine Stack: One API, Multiple Backends, Zero Lock-in
The Problem
You have multiple inference backends. Ollama runs locally with your everyday models. A llama.cpp server handles a specialized model. FreeLLMAPI provides cloud fallback when local hardware is busy or down.
Every tool you use — OpenCode, VS Code extensions, custom scripts — needs to know which endpoint to hit. Change a backend, change every config. Add a backend, update everything.
This is the wrong architecture. The tools should not know about your backends. They should know one endpoint.
The Solution
localllm-engine is a Node.js router that presents a single OpenAI-compatible API at http://localhost:3001/v1. Behind it, requests route to the best available backend based on priority, health, and policy.
``
Your Tools (OpenCode, VS Code, scripts)
|
v
localllm-engine (:3001/v1)
|
+---> Ollama (priority 1, local)
+---> llama.cpp (priority 2, local)
+---> FreeLLMAPI (priority 3, cloud fallback)
`
Every tool points at
localhost:3001. The engine decides where the request actually goes.
Architecture
Provider Interface
Each backend implements a common interface:
health() — is this backend reachable? Cached for 30 seconds.
models() — what models does this backend serve?
chat(request) — forward a chat completion request, return streaming or non-streaming response.
embeddings(request) — forward an embeddings request.
Three providers ship with the engine: Ollama, llama.cpp, and FreeLLMAPI. Adding a new provider means implementing this interface.
Routing Logic
When a request arrives at
/v1/chat/completions:
1. Explicit routing: If the model name has a prefix (
ollama:llama3.1, freellmapi:gpt-4o), route directly to that provider.
2. Policy routing: Check the
X-Routing-Privacy header. If local_only, exclude cloud providers.
3. Priority routing: Walk the provider list in order (Ollama, llama.cpp, FreeLLMAPI). First healthy provider wins.
4. Fallback: If all providers are down, return a 503 with the failure reason.
Every routing decision is logged for the
/v1/engine/stats endpoint.
Streaming
The engine supports true Server-Sent Events (SSE) streaming. When a client requests
stream: true:
1. Engine opens an upstream SSE connection to the chosen provider.
2. Each token is forwarded to the client as it arrives.
3. No buffering. Token-by-token passthrough.
This matters for coding agents. A buffered response that arrives all at once after 30 seconds feels broken. Token streaming feels responsive.
API Surface
OpenAI-Compatible
| Method | Path | Purpose |
|--------|------|--------|
| POST |
/v1/chat/completions | Chat completion (streaming + non-streaming) |
| GET |
/v1/models | Aggregated model list from all providers |
| GET |
/v1/models/:id | Single model details |
| POST |
/v1/embeddings | Generate embeddings |
Engine Diagnostics
| Method | Path | Purpose |
|--------|------|--------|
| GET |
/v1/engine/health | Provider health, latency per backend |
| GET |
/v1/engine/stats | Routing decisions, request counts, fallback reasons |
| GET |
/health | Simple liveness check |
Privacy Routing
The engine supports three privacy levels via the
X-Routing-Privacy header:
| Value | Behavior |
|-------|----------|
|
local_only | Only Ollama and llama.cpp. Error if both are down. |
|
local_preferred | Try local first, fall back to FreeLLMAPI. (Default) |
|
any | Use whatever is fastest/available. |
This is the sovereignty layer. Sensitive code — proprietary logic, credentials in context, internal documentation — stays on
local_only. General queries can fall back to cloud.
No configuration change required per-tool. The header travels with the request.
Configuration
All configuration is via environment variables:
`
PORT=3001
HOST=0.0.0.0
OLLAMA_BASE_URL=http://localhost:11434
LLAMACPP_BASE_URL=http://localhost:8080
FREELLMAPI_BASE_URL=http://your-freellmapi-instance
FREELLMAPI_API_KEY=your-key
`
If a URL is not set, that provider is disabled. The engine adapts to whatever backends are available.
Model Selection
Clients can target specific providers via model name prefixes:
| Model String | Routes To |
|-------------|----------|
|
auto | Best available (local first) |
|
ollama:llama3.1 | Ollama, specific model |
|
ollama:codellama | Ollama, specific model |
|
llamacpp:local-model | llama.cpp server |
|
freellmapi:auto | Cloud fallback |
|
freellmapi:gpt-4o | Cloud, specific model |
Without a prefix, the engine uses priority routing.
Deployment
Development
`
git clone https://github.com/nrupala/localllm-engine.git
cd localllm-engine
npm install
cp .env.example .env
Edit .env with your backend URLs
npm run dev
`
Hot-reloads on file changes via
tsx watch.
Production (Systemd)
`
npm run build
npm start
`
Wrap in a systemd service for auto-restart:
`
[Unit]
Description=LocalLLM Engine
After=network.target ollama.service
[Service]
Type=simple
User=your-user
WorkingDirectory=/opt/localllm-engine
ExecStart=/usr/bin/node dist/index.js
Restart=always
EnvironmentFile=/opt/localllm-engine/.env
[Install]
WantedBy=multi-user.target
`
Standalone Binary
The engine can be packaged to a single executable (no Node.js required):
`
npm run bundle:linux # Linux x64
npm run bundle:macos # macOS x64/ARM
npm run bundle:win # Windows x64
`
The resulting binary is self-contained. Copy it to any machine, set environment variables, run.
Observability
The
/v1/engine/stats endpoint returns:
- Total requests per provider
- Average latency per provider
- Fallback count and reasons
- Current health status of each backend
- Last 50 routing decisions with timestamps
This data feeds monitoring. If Ollama starts timing out, you see it in stats before users complain.
The Web UI (llm.devinfo.dev)
The engine ships with a PWA dashboard:
- Chat view: Talk to any model through the engine
- Models view: See all available models across all backends
- Stats view: Routing decisions, latency graphs, health status
- Settings view: Configure endpoint, default model, privacy level
Deployed to Cloudflare Pages at
llm.devinfo.dev. The UI connects to your running engine instance — it does not host inference itself.
Why Not Just Use Ollama Directly?
Ollama is excellent for single-backend, single-user use. The engine adds value when:
1. You have multiple backends (Ollama + llama.cpp + cloud)
2. You want automatic failover without reconfiguring tools
3. You need privacy routing (some requests local-only, others cloud-ok)
4. You want a single API that aggregates models from multiple sources
5. You want observability over routing decisions
If you only run Ollama and never need failover, Ollama alone is simpler. Use the simpler tool.
Project Structure
`
localllm-engine/
src/
index.ts # Express app, middleware, startup
lib/
types.ts # OpenAI-compatible type definitions
router.ts # Policy-based provider routing
observability.ts # Request tracking and stats
providers/
base.ts # Provider interface + SSE helpers
ollama.ts # Ollama adapter
llamacpp.ts # llama.cpp adapter
freellmapi.ts # FreeLLMAPI adapter (cloud)
routes/
chat.ts # POST /v1/chat/completions
models.ts # GET /v1/models
embeddings.ts # POST /v1/embeddings
engine.ts # GET /v1/engine/health, /stats
middleware/
cors.ts # CORS + request logging
ui/ # PWA dashboard (deployed to Pages)
package.json
tsconfig.json
.env.example
``
References
- localllm-engine source. https://github.com/nrupala/localllm-engine
- Ollama API documentation. https://github.com/ollama/ollama/blob/main/docs/api.md
- llama.cpp server documentation. https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md
- FreeLLMAPI. Ahmed, T. (2024). "OpenAI-compatible proxy aggregating free-tier AI providers." https://github.com/tashfeenahmed/freellmapi
- OpenAI API Reference. (2024). "Chat Completions." https://platform.openai.com/docs/api-reference/chat
- Server-Sent Events specification. W3C. https://html.spec.whatwg.org/multipage/server-sent-events.html
Cite as
devinfo.dev. (2026). "The LocalLLM Engine Stack: One API, Multiple Backends, Zero Lock-in." devinfo.dev:2026.0013. https://devinfo.dev/d/2026.0013
devinfo.dev | https://devinfo.dev/d/2026.0013
Content licensed under CC BY-NC 4.0. Free to share with attribution for non-commercial use.
https://devinfo.dev