The LocalLLM Engine Stack: One API, Multiple Backends, Zero Lock-in

The Problem

You have multiple inference backends. Ollama runs locally with your everyday models. A llama.cpp server handles a specialized model. FreeLLMAPI provides cloud fallback when local hardware is busy or down.

Every tool you use — OpenCode, VS Code extensions, custom scripts — needs to know which endpoint to hit. Change a backend, change every config. Add a backend, update everything.

This is the wrong architecture. The tools should not know about your backends. They should know one endpoint.

The Solution

localllm-engine is a Node.js router that presents a single OpenAI-compatible API at http://localhost:3001/v1. Behind it, requests route to the best available backend based on priority, health, and policy.


Your Tools (OpenCode, VS Code, scripts)
          |
          v
  localllm-engine (:3001/v1)
          |
          +---> Ollama         (priority 1, local)
          +---> llama.cpp      (priority 2, local)
          +---> FreeLLMAPI     (priority 3, cloud fallback)

Every tool points at localhost:3001. The engine decides where the request actually goes.


Architecture
Provider Interface
Each backend implements a common interface:

health() — is this backend reachable? Cached for 30 seconds.

models() — what models does this backend serve?

chat(request) — forward a chat completion request, return streaming or non-streaming response.

embeddings(request) — forward an embeddings request.


Three providers ship with the engine: Ollama, llama.cpp, and FreeLLMAPI. Adding a new provider means implementing this interface.
Routing Logic

When a request arrives at /v1/chat/completions:

1. Explicit routing: If the model name has a prefix (ollama:llama3.1, freellmapi:gpt-4o), route directly to that provider.

2. Policy routing: Check the X-Routing-Privacy header. If local_only, exclude cloud providers.


3. Priority routing: Walk the provider list in order (Ollama, llama.cpp, FreeLLMAPI). First healthy provider wins.
4. Fallback: If all providers are down, return a 503 with the failure reason.

Every routing decision is logged for the /v1/engine/stats endpoint.


Streaming

The engine supports true Server-Sent Events (SSE) streaming. When a client requests stream: true:


1. Engine opens an upstream SSE connection to the chosen provider.
2. Each token is forwarded to the client as it arrives.
3. No buffering. Token-by-token passthrough.
This matters for coding agents. A buffered response that arrives all at once after 30 seconds feels broken. Token streaming feels responsive.
API Surface
OpenAI-Compatible
| Method | Path | Purpose |
|--------|------|--------|

| POST | /v1/chat/completions | Chat completion (streaming + non-streaming) |

| GET | /v1/models | Aggregated model list from all providers |

| GET | /v1/models/:id | Single model details |

| POST | /v1/embeddings | Generate embeddings |


Engine Diagnostics
| Method | Path | Purpose |
|--------|------|--------|

| GET | /v1/engine/health | Provider health, latency per backend |

| GET | /v1/engine/stats | Routing decisions, request counts, fallback reasons |

| GET | /health | Simple liveness check |


Privacy Routing

The engine supports three privacy levels via the X-Routing-Privacy header:


| Value | Behavior |
|-------|----------|

| local_only | Only Ollama and llama.cpp. Error if both are down. |

| local_preferred | Try local first, fall back to FreeLLMAPI. (Default) |

| any | Use whatever is fastest/available. |

This is the sovereignty layer. Sensitive code — proprietary logic, credentials in context, internal documentation — stays on local_only. General queries can fall back to cloud.


No configuration change required per-tool. The header travels with the request.
Configuration
All configuration is via environment variables:


PORT=3001
HOST=0.0.0.0
OLLAMA_BASE_URL=http://localhost:11434
LLAMACPP_BASE_URL=http://localhost:8080
FREELLMAPI_BASE_URL=http://your-freellmapi-instance
FREELLMAPI_API_KEY=your-key


If a URL is not set, that provider is disabled. The engine adapts to whatever backends are available.
Model Selection
Clients can target specific providers via model name prefixes:
| Model String | Routes To |
|-------------|----------|

| auto | Best available (local first) |

| ollama:llama3.1 | Ollama, specific model |

| ollama:codellama | Ollama, specific model |

| llamacpp:local-model | llama.cpp server |

| freellmapi:auto | Cloud fallback |

| freellmapi:gpt-4o | Cloud, specific model |


Without a prefix, the engine uses priority routing.
Deployment
Development


git clone https://github.com/nrupala/localllm-engine.git
cd localllm-engine
npm install
cp .env.example .env
Edit .env with your backend URLs
npm run dev

Hot-reloads on file changes via tsx watch.


Production (Systemd)


npm run build
npm start


Wrap in a systemd service for auto-restart:


[Unit]
Description=LocalLLM Engine
After=network.target ollama.service
[Service]
Type=simple
User=your-user
WorkingDirectory=/opt/localllm-engine
ExecStart=/usr/bin/node dist/index.js
Restart=always
EnvironmentFile=/opt/localllm-engine/.env
[Install]
WantedBy=multi-user.target


Standalone Binary
The engine can be packaged to a single executable (no Node.js required):


npm run bundle:linux    # Linux x64
npm run bundle:macos    # macOS x64/ARM
npm run bundle:win      # Windows x64


The resulting binary is self-contained. Copy it to any machine, set environment variables, run.
Observability

The /v1/engine/stats endpoint returns:


Total requests per provider
Average latency per provider
Fallback count and reasons
Current health status of each backend
Last 50 routing decisions with timestamps

This data feeds monitoring. If Ollama starts timing out, you see it in stats before users complain.
The Web UI (llm.devinfo.dev)
The engine ships with a PWA dashboard:
Chat view: Talk to any model through the engine
Models view: See all available models across all backends
Stats view: Routing decisions, latency graphs, health status
Settings view: Configure endpoint, default model, privacy level

Deployed to Cloudflare Pages at llm.devinfo.dev. The UI connects to your running engine instance — it does not host inference itself.


Why Not Just Use Ollama Directly?
Ollama is excellent for single-backend, single-user use. The engine adds value when:
1. You have multiple backends (Ollama + llama.cpp + cloud)
2. You want automatic failover without reconfiguring tools
3. You need privacy routing (some requests local-only, others cloud-ok)
4. You want a single API that aggregates models from multiple sources
5. You want observability over routing decisions
If you only run Ollama and never need failover, Ollama alone is simpler. Use the simpler tool.
Project Structure


localllm-engine/
  src/
    index.ts              # Express app, middleware, startup
    lib/
      types.ts            # OpenAI-compatible type definitions
      router.ts           # Policy-based provider routing
      observability.ts    # Request tracking and stats
    providers/
      base.ts             # Provider interface + SSE helpers
      ollama.ts           # Ollama adapter
      llamacpp.ts         # llama.cpp adapter
      freellmapi.ts       # FreeLLMAPI adapter (cloud)
    routes/
      chat.ts             # POST /v1/chat/completions
      models.ts           # GET /v1/models
      embeddings.ts       # POST /v1/embeddings
      engine.ts           # GET /v1/engine/health, /stats
    middleware/
      cors.ts             # CORS + request logging
  ui/                     # PWA dashboard (deployed to Pages)
  package.json
  tsconfig.json
  .env.example

References

localllm-engine source. https://github.com/nrupala/localllm-engine
Ollama API documentation. https://github.com/ollama/ollama/blob/main/docs/api.md
llama.cpp server documentation. https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md
FreeLLMAPI. Ahmed, T. (2024). "OpenAI-compatible proxy aggregating free-tier AI providers." https://github.com/tashfeenahmed/freellmapi
OpenAI API Reference. (2024). "Chat Completions." https://platform.openai.com/docs/api-reference/chat
Server-Sent Events specification. W3C. https://html.spec.whatwg.org/multipage/server-sent-events.html