Use Ctrl+P (or Cmd+P) to save as PDF. Back to paper

The Router Is the System

inspiration | devinfo.dev | June 17, 2026 | devinfo.dev:2026.0036

Routing between models is not a configuration detail. It is a measurable, trainable system boundary — and treating it as one cuts inference costs by 40–85% without sacrificing quality.

The Router Is the System

Every multi-model inference stack has a routing decision. Most teams make it once, at architecture time, and never revisit it: "send everything to the big model."

That is a policy. It just isn't a deliberate one.

The Problem It Solves

A strong model and a weak model handle easy questions identically. The strong model costs more. If you can predict which questions are easy, you can route them to the cheap model without the user noticing.

This is not a clever hack. It is a system design problem with a clean formulation: given a query, which model produces acceptable quality at minimum cost?

RouteLLM (Ong et al., UC Berkeley / LMSYS, 2024) operationalized this. They trained router models on human preference data — Chatbot Arena comparisons of strong vs. weak model responses — and used that signal to classify incoming queries. Result: 95% of GPT-4 quality at 26% of GPT-4 cost on standard benchmarks. The router adds less than 0.4% overhead to total inference cost.

The Router Is Learnable

The key insight from RouteLLM is that routing quality is measurable. They define a metric called Average Performance Gap Recovered (APGR): how much of the quality gap between the weak and strong model does the router recover, across the full range of cost thresholds?

A trivial router — always route to the strong model — gets a perfect quality score and zero cost reduction. A random router gets maximum cost reduction and terrible quality. A trained router navigates between them.

The threshold parameter is the lever. You calibrate it against a sample of your actual traffic to hit whatever cost-quality tradeoff your application requires.

Cascading vs. Routing

Routing makes one decision: which model handles this request. Cascading makes a sequence: try the weak model first; if confidence is low, escalate to the strong model.

Dekoninck et al. (ICML 2025) showed these are not competing paradigms — they are points on the same design space. Cascade routing consistently outperforms pure routing or pure cascading, because it combines the efficiency of routing on easy queries with the safety net of escalation on uncertain ones.

The critical factor in both approaches: quality estimation. How well can you predict, before or after the weak model runs, whether the output is acceptable? That estimator is the core of the system. Everything else is plumbing.

What This Means in Practice

1. The routing layer is a first-class component. It belongs in your architecture diagram, your monitoring stack, and your cost model — not buried in a config file.

2. Static rules are the right starting point. Route by query length, topic classification, or keyword match. This gets you 30–50% of the savings before you touch learned routing.

3. Learned routers transfer. RouteLLM routers trained on GPT-4 vs. a weak baseline maintained their performance when the strong/weak model pair was swapped at test time. You do not need to retrain for every model change.

4. The 85% cost reduction figure is real, but conditional. It assumes a traffic mix where a large fraction of queries are genuinely simple. Measure your own traffic before setting expectations.

The Underlying Principle

A model is not a service. It is a cost function. Every query has a minimum-cost model that can answer it adequately. The routing system's job is to find that model before the inference runs.

If you are sending all queries to one model, you are not running a production inference system. You are running a prototype at production cost.

References

1. Ong, I., Almahairi, A., Wu, V., Chiang, W.-L., Wu, T., Gonzalez, J. E., Kadous, M. W., & Stoica, I. (2024). RouteLLM: Learning to Route LLMs with Preference Data. arXiv:2406.18665. https://arxiv.org/abs/2406.18665

2. lm-sys/RouteLLM GitHub repository. LMSYS Org. https://github.com/lm-sys/RouteLLM

3. Dekoninck, J., Baader, M., & Vechev, M. (2025). A Unified Approach to Routing and Cascading for LLMs. Proceedings of the 42nd International Conference on Machine Learning (ICML 2025), pp. 12987–13010. https://proceedings.mlr.press/v267/dekoninck25a.html

4. Klymentiev, D. (2026). LLM Router 2026: RouteLLM Benchmarks, Cut Costs 30–85%. https://klymentiev.com/blog/llm-router

5. Google Research. (2024). Speculative Cascades — A Hybrid Approach for Smarter, Faster LLM Inference. https://research.google/blog/speculative-cascades-a-hybrid-approach-for-smarter-faster-llm-inference/