booklet

OpenCode with Local Models: Pointing Your Coding Agent at Your Own Inference

devinfo.dev — May 27, 2026

devinfo.dev:2026.0011

#opencode #ollama #coding-agent #local-inference #self-hosted

Save as PDF

What OpenCode Is

OpenCode is a CLI coding agent. It reads your codebase, proposes edits, runs commands, and iterates — all from the terminal. Think of it as an agentic layer on top of an LLM that understands file systems, git, and code structure.

By default it talks to cloud APIs (Claude, GPT-4, DeepSeek). But the protocol underneath is OpenAI-compatible. Any server that speaks /v1/chat/completions with tool calling support can serve as the backend.

This means your Ollama instance, your vLLM server, or your localllm-engine router can power OpenCode. No cloud. No API keys. Full sovereignty over your prompts and code.

The Configuration Hierarchy

OpenCode reads configuration from multiple sources, in this order (later wins):

1. Remote config (cloud-managed, if connected)

2. Global config: ~/.config/opencode/opencode.json

3. Custom env var config

4. Project config: .opencode.json in your project root

5. Inline environment variables

For local models, you typically set project-level config (.opencode.json) or environment variables.

Method 1: Environment Variables (Quick Start)

The fastest path to local inference:


export LOCAL_ENDPOINT=http://localhost:11434/v1
export OPENCODE_MODEL_NAME=llama3.1:8b
opencode


This points OpenCode at Ollama running on the default port. Replace the endpoint with your server:
| Server | Endpoint |
|--------|----------|

| Ollama | http://localhost:11434/v1 |

| vLLM | http://localhost:8000/v1 |

| LM Studio | http://localhost:1234/v1 |

| localllm-engine | http://localhost:3001/v1 |

| llama-cpp-python | http://localhost:8080/v1 |


Method 2: Project Config (.opencode.json)

Create .opencode.json in your project root:


{
  "provider": {
    "local": {
      "baseURL": "http://localhost:11434/v1",
      "apiKey": "local"
    }
  },
  "agents": {
    "coder": {
      "model": "local.llama3.1:8b",
      "maxTokens": 4096
    },
    "task": {
      "model": "local.llama3.1:8b",
      "maxTokens": 4096
    }
  }
}

The model format is provider.modelname. The apiKey field is required by the schema but can be any non-empty string for local servers that do not authenticate.


Method 3: CLI Flag
Override per-session:


opencode --model ollama/qwen3:8b
opencode --model local.codellama:13b


Useful for testing different models without changing config files.
What the Model Must Support
OpenCode is not just a chat wrapper. It is an agent. It calls tools: file reads, file writes, shell commands, search. This means your local model must support the OpenAI function calling / tools API format.
Models that work:
Llama 3.1+ (8B, 70B) — native tool calling since Ollama 0.3+
Qwen 2.5 / Qwen 3 — strong tool calling support
Mistral / Mixtral — function calling supported
DeepSeek Coder V2+ — tool calling supported
Command R+ — native tool use

Models that do not work well:
Older Llama 2 models — no native tool calling
Phi-2 / Phi-3 (small variants) — unreliable function call formatting
Any model without structured output capability

If the model cannot reliably produce valid JSON tool calls, OpenCode will error or hallucinate file edits. This is the primary failure mode with local models.
Context Window: The Hidden Constraint
OpenCode sends your codebase context (file contents, directory trees, previous edits) as part of each prompt. A typical agentic turn can consume 8,000-16,000 tokens of context before the model even starts reasoning.
Default Ollama context is 2048 tokens. This is not enough.
Set a larger context window:


export OLLAMA_NUM_CTX=32768
ollama serve


Or in your Modelfile:


PARAMETER num_ctx 32768


Without this, OpenCode will silently truncate context and produce nonsensical edits because the model cannot see the relevant code.
The localllm-engine Advantage
If you run localllm-engine (the router from devinfo.dev's stack), you get:

1. Single endpoint for OpenCode: http://localhost:3001/v1


2. Automatic failover: Ollama down? Falls back to llama.cpp or cloud.

3. Privacy routing: Set X-Routing-Privacy: local_only header to ensure code never leaves your machine.

4. Model aggregation: All models from all backends appear at /v1/models.


Configure OpenCode to use the engine:


{
  "provider": {
    "engine": {
      "baseURL": "http://localhost:3001/v1",
      "apiKey": "local"
    }
  },
  "agents": {
    "coder": {
      "model": "engine.auto",
      "maxTokens": 4096
    }
  }
}

The auto model lets the engine choose the best available backend.


Performance Expectations
Local inference is slower than cloud APIs. Set expectations:
| Model | Hardware | Tokens/sec | Usable for coding? |
|-------|----------|-----------|--------------------|
| Llama 3.1 8B Q4 | RTX 4090 | ~80 tok/s | Yes — responsive |
| Llama 3.1 8B Q4 | M2 Pro | ~40 tok/s | Yes — acceptable |
| Llama 3.1 8B Q4 | CPU only (16 core) | ~8 tok/s | Barely |
| Llama 3.1 70B Q4 | RTX 4090 | ~15 tok/s | Slow but functional |
| Qwen 2.5 Coder 32B Q4 | 2x RTX 4090 | ~25 tok/s | Good |
For coding agents, you want at minimum 20 tok/s for interactive use. Below that, the agent feels broken (long pauses between edits).
Known Issues
Model not discovered in listing:
Ollama models do not auto-populate in OpenCode's model selector. You must specify them explicitly in config or via CLI flag.
Tool calls malformed:
Some quantized models (especially Q2/Q3) produce invalid JSON in function calls. Use Q4_K_M or higher for reliable tool use.
Context overflow silent failure:
If your model's context window is exceeded, Ollama silently truncates. OpenCode does not warn you. Watch for edits that ignore recently-shown code — it means context was lost.
Streaming interruption:
Some proxy setups (nginx, reverse proxy) timeout on long-running streaming responses. Set proxy timeouts to at least 120s for agent workloads.
The Practical Setup
Minimal working configuration for a developer workstation:

1. Install Ollama: curl -fsSL https://ollama.com/install.sh | sh

2. Pull a coding model: ollama pull qwen2.5-coder:7b

3. Set context: export OLLAMA_NUM_CTX=32768

4. Start Ollama: ollama serve

5. Configure OpenCode: set LOCAL_ENDPOINT=http://localhost:11434/v1

6. Run: opencode --model ollama/qwen2.5-coder:7b`

You now have a fully local coding agent. No API keys. No cloud dependency. Your code stays on your machine.

References

OpenCode. (2026). "Self-hosted model providers." https://opencode-ai-opencode.mintlify.app/advanced/self-hosted-models
OpenCode. (2026). "Configuration." https://dev.opencode.ai/docs/config/
OpenCode. (2026). "CLI Reference." https://dev.opencode.ai/docs/cli/
OpenCode Guide. (2026). "Running Local Models with OpenCode & Ollama." https://opencodeguide.com/en/opencode-with-ollama/
Tobrun. (2026). "Configure Local LLM with OpenCode." DEV Community. https://dev.to/tobrun/configure-local-llm-with-opencode-1gdb
Ollama. (2024). "API Documentation." https://github.com/ollama/ollama/blob/main/docs/api.md
OpenCode GitHub Issue #12243. "Local Ollama models not included in model listing." https://github.com/anomalyco/opencode/issues/12243

Cite as

devinfo.dev. (2026). "OpenCode with Local Models: Pointing Your Coding Agent at Your Own Inference." devinfo.dev:2026.0011. https://devinfo.dev/d/2026.0011

devinfo.dev | https://devinfo.dev/d/2026.0011
Content licensed under CC BY-NC 4.0. Free to share with attribution for non-commercial use.
https://devinfo.dev

What OpenCode Is

The Configuration Hierarchy

Method 1: Environment Variables (Quick Start)

Method 2: Project Config (.opencode.json)

Method 3: CLI Flag

What the Model Must Support

Context Window: The Hidden Constraint

The localllm-engine Advantage

Performance Expectations

Known Issues

The Practical Setup

References

Cite as

See also