Use Ctrl+P (or Cmd+P) to save as PDF. Back to paper

The LocalLLM Engine Stack: One API, Multiple Backends, Zero Lock-in

booklet | devinfo.dev | May 27, 2026 | devinfo.dev:2026.0013

A single OpenAI-compatible endpoint that routes across Ollama, llama.cpp, and FreeLLMAPI with automatic failover. This booklet documents the architecture, routing logic, and deployment of the localllm-engine.

The Problem

You have multiple inference backends. Ollama runs locally with your everyday models. A llama.cpp server handles a specialized model. FreeLLMAPI provides cloud fallback when local hardware is busy or down.

Every tool you use — OpenCode, VS Code extensions, custom scripts — needs to know which endpoint to hit. Change a backend, change every config. Add a backend, update everything.

This is the wrong architecture. The tools should not know about your backends. They should know one endpoint.

The Solution

localllm-engine is a Node.js router that presents a single OpenAI-compatible API at http://localhost:3001/v1. Behind it, requests route to the best available backend based on priority, health, and policy.

``

Your Tools (OpenCode, VS Code, scripts)

|

v

localllm-engine (:3001/v1)

|

+---> Ollama (priority 1, local)

+---> llama.cpp (priority 2, local)

+---> FreeLLMAPI (priority 3, cloud fallback)

`

Every tool points at localhost:3001. The engine decides where the request actually goes.

Architecture

Provider Interface

Each backend implements a common interface:

Three providers ship with the engine: Ollama, llama.cpp, and FreeLLMAPI. Adding a new provider means implementing this interface.

Routing Logic

When a request arrives at /v1/chat/completions:

1. Explicit routing: If the model name has a prefix (ollama:llama3.1, freellmapi:gpt-4o), route directly to that provider.

2. Policy routing: Check the X-Routing-Privacy header. If local_only, exclude cloud providers.

3. Priority routing: Walk the provider list in order (Ollama, llama.cpp, FreeLLMAPI). First healthy provider wins.

4. Fallback: If all providers are down, return a 503 with the failure reason.

Every routing decision is logged for the /v1/engine/stats endpoint.

Streaming

The engine supports true Server-Sent Events (SSE) streaming. When a client requests stream: true:

1. Engine opens an upstream SSE connection to the chosen provider.

2. Each token is forwarded to the client as it arrives.

3. No buffering. Token-by-token passthrough.

This matters for coding agents. A buffered response that arrives all at once after 30 seconds feels broken. Token streaming feels responsive.

API Surface

OpenAI-Compatible

| Method | Path | Purpose |

|--------|------|--------|

| POST | /v1/chat/completions | Chat completion (streaming + non-streaming) |

| GET | /v1/models | Aggregated model list from all providers |

| GET | /v1/models/:id | Single model details |

| POST | /v1/embeddings | Generate embeddings |

Engine Diagnostics

| Method | Path | Purpose |

|--------|------|--------|

| GET | /v1/engine/health | Provider health, latency per backend |

| GET | /v1/engine/stats | Routing decisions, request counts, fallback reasons |

| GET | /health | Simple liveness check |

Privacy Routing

The engine supports three privacy levels via the X-Routing-Privacy header:

| Value | Behavior |

|-------|----------|

| local_only | Only Ollama and llama.cpp. Error if both are down. |

| local_preferred | Try local first, fall back to FreeLLMAPI. (Default) |

| any | Use whatever is fastest/available. |

This is the sovereignty layer. Sensitive code — proprietary logic, credentials in context, internal documentation — stays on local_only. General queries can fall back to cloud.

No configuration change required per-tool. The header travels with the request.

Configuration

All configuration is via environment variables:

`

PORT=3001

HOST=0.0.0.0

OLLAMA_BASE_URL=http://localhost:11434

LLAMACPP_BASE_URL=http://localhost:8080

FREELLMAPI_BASE_URL=http://your-freellmapi-instance

FREELLMAPI_API_KEY=your-key

`

If a URL is not set, that provider is disabled. The engine adapts to whatever backends are available.

Model Selection

Clients can target specific providers via model name prefixes:

| Model String | Routes To |

|-------------|----------|

| auto | Best available (local first) |

| ollama:llama3.1 | Ollama, specific model |

| ollama:codellama | Ollama, specific model |

| llamacpp:local-model | llama.cpp server |

| freellmapi:auto | Cloud fallback |

| freellmapi:gpt-4o | Cloud, specific model |

Without a prefix, the engine uses priority routing.

Deployment

Development

`

git clone https://github.com/nrupala/localllm-engine.git

cd localllm-engine

npm install

cp .env.example .env

Edit .env with your backend URLs

npm run dev

`

Hot-reloads on file changes via tsx watch.

Production (Systemd)

`

npm run build

npm start

`

Wrap in a systemd service for auto-restart:

`

[Unit]

Description=LocalLLM Engine

After=network.target ollama.service

[Service]

Type=simple

User=your-user

WorkingDirectory=/opt/localllm-engine

ExecStart=/usr/bin/node dist/index.js

Restart=always

EnvironmentFile=/opt/localllm-engine/.env

[Install]

WantedBy=multi-user.target

`

Standalone Binary

The engine can be packaged to a single executable (no Node.js required):

`

npm run bundle:linux # Linux x64

npm run bundle:macos # macOS x64/ARM

npm run bundle:win # Windows x64

`

The resulting binary is self-contained. Copy it to any machine, set environment variables, run.

Observability

The /v1/engine/stats endpoint returns:

This data feeds monitoring. If Ollama starts timing out, you see it in stats before users complain.

The Web UI (llm.devinfo.dev)

The engine ships with a PWA dashboard:

Deployed to Cloudflare Pages at llm.devinfo.dev. The UI connects to your running engine instance — it does not host inference itself.

Why Not Just Use Ollama Directly?

Ollama is excellent for single-backend, single-user use. The engine adds value when:

1. You have multiple backends (Ollama + llama.cpp + cloud)

2. You want automatic failover without reconfiguring tools

3. You need privacy routing (some requests local-only, others cloud-ok)

4. You want a single API that aggregates models from multiple sources

5. You want observability over routing decisions

If you only run Ollama and never need failover, Ollama alone is simpler. Use the simpler tool.

Project Structure

`

localllm-engine/

src/

index.ts # Express app, middleware, startup

lib/

types.ts # OpenAI-compatible type definitions

router.ts # Policy-based provider routing

observability.ts # Request tracking and stats

providers/

base.ts # Provider interface + SSE helpers

ollama.ts # Ollama adapter

llamacpp.ts # llama.cpp adapter

freellmapi.ts # FreeLLMAPI adapter (cloud)

routes/

chat.ts # POST /v1/chat/completions

models.ts # GET /v1/models

embeddings.ts # POST /v1/embeddings

engine.ts # GET /v1/engine/health, /stats

middleware/

cors.ts # CORS + request logging

ui/ # PWA dashboard (deployed to Pages)

package.json

tsconfig.json

.env.example

``

References