AI & Strategy·13 min read

AI Model Routing: Cut LLM Costs with Smart Orchestration in 2026

Most teams run every prompt through their most expensive model. AI model routing sends easy requests to cheap models and hard requests to premium ones, cutting LLM spend 40-70% without sacrificing quality.

Nate Laquis

Nate Laquis

Founder & CEO

Why One Model for Every Prompt Is Burning Your Budget

Here is the uncomfortable truth most AI teams discover around month four of production. You picked GPT-4o or Claude Sonnet 4.5 because it passed your eval suite, wired it into every endpoint, and now your invoice looks like a Series A fundraise. The problem is not the model. The problem is that you are paying premium tier pricing for requests that a model fifteen times cheaper could handle perfectly.

In a typical production workload, roughly 60 to 80 percent of LLM requests are simple. Classification, extraction, short rewrites, FAQ lookups, intent detection, summarization of short inputs. These do not need a frontier reasoning model. They need a fast, cheap workhorse. The other 20 to 40 percent, the ones involving multi step reasoning, long context, code generation, or nuanced writing, absolutely do justify Sonnet 4.5 at 3 dollars per million input tokens and 15 dollars per million output.

AI model routing is the discipline of deciding, per request, which model should handle it. Done well, it delivers 40 to 70 percent cost reduction with no measurable drop in output quality. Done badly, it creates latency spikes and quality regressions your users will notice before your dashboards do. This guide covers the tooling, patterns, and pitfalls we see across production deployments in 2026.

Dashboard showing LLM cost analytics

The Tiered Model Strategy: Cheap, Mid, and Frontier

Every effective routing setup starts with a tiering hierarchy. You define three or four performance and cost tiers, then decide which tier a request belongs to before you touch the expensive ones. Here is the 2026 landscape, grouped by real price points.

Tier 1 (Cheap and Fast). Claude Haiku 4 at 0.80 dollars per million input and 4 dollars per million output. Gemini 2.5 Flash at 0.30 dollars in and 2.50 dollars out. GPT-4o mini at 0.15 dollars in and 0.60 dollars out. DeepSeek V3 at 0.27 dollars in and 1.10 dollars out. These models handle classification, extraction, simple rewrites, short summaries, and routine tool calls at native speeds of 150 to 300 tokens per second.

Tier 2 (Balanced). Claude Sonnet 4.5, GPT-4o, Gemini 2.5 Pro. Sonnet 4.5 sits at 3 dollars in and 15 dollars out. These are your daily drivers for anything involving moderate reasoning, multi turn conversation, or mid complexity code edits.

Tier 3 (Frontier). Claude Opus 4, GPT-5, Gemini 3 Ultra. Opus 4 comes in at 15 dollars in and 75 dollars out, roughly twenty times more expensive than Haiku. You route here only for genuinely hard reasoning, long agentic chains, complex code refactors, and research grade synthesis.

The math is brutal. If 70 percent of your traffic is Tier 1 eligible and you are running it through Opus 4, you are paying almost 100x more than you need to for those requests. We have walked into clients spending 180 thousand dollars a month on Claude where a proper tiered routing layer would have put them closer to 55 thousand. For a deeper breakdown of provider economics, see our LLM API pricing comparison.

Classifier Based Routing: Pick the Right Model Before You Run It

The core routing decision is this. Given an incoming prompt, which tier should handle it? There are three mainstream approaches in production.

Rule based routing. The simplest option. You define rules by endpoint, user tier, prompt length, or metadata. Your classification endpoint always goes to Gemini Flash. Your code generation endpoint always goes to Sonnet 4.5. Enterprise customers on your top tier always get Opus 4. This is crude but it works, and it is the right starting point because it requires zero extra infrastructure.

Classifier based routing. You train or configure a lightweight classifier, usually a small model like GPT-4o mini or a fine tuned BERT variant, that inspects the incoming prompt and predicts complexity. It outputs a tier label, and your router sends the request to the appropriate model. This captures the 20 to 40 percent of traffic that cannot be statically routed. Tools like Not Diamond and Martian ship with pretrained classifiers you can plug in directly.

Semantic similarity routing. You maintain an embedding index of historical prompts along with which model successfully handled them. New prompts get embedded, matched against the index, and routed to the same tier as their nearest neighbors. This is how Requesty and some internal systems at high volume shops work. It adapts automatically as your prompt distribution drifts.

Our default recommendation for teams starting out is a hybrid. Use rules for 80 percent of traffic where the endpoint maps cleanly to a tier, then use a cheap classifier for the rest. You get most of the savings without introducing embedding infrastructure you have to maintain.

Data flow visualization for request routing

The Routing Tooling Landscape in 2026

You do not need to build this from scratch. There are now five mature routing platforms, each with a distinct philosophy.

Portkey. The most production hardened option. Portkey acts as an AI gateway, giving you unified API access to 200 plus models, configurable routing rules, fallback chains, semantic caching, guardrails, and observability in one place. You change your base URL to Portkey, add a config header specifying routing rules in JSON, and you are done. Pricing starts around 99 dollars per month for teams and scales with request volume. We use Portkey on most client deployments because it gets you 90 percent of the wins in a week.

OpenRouter. The marketplace approach. OpenRouter aggregates dozens of providers behind a single OpenAI compatible API and lets you specify model preferences including automatic fallback. It is cheaper than running individual provider accounts because OpenRouter often passes along volume discounts. Great for exploration and cost comparison, less featured for enterprise governance.

Martian. Routing as a research product. Martian publishes real benchmarks showing their router achieving GPT-4 level quality at a fraction of the cost by dynamically selecting models. It is best when you want a hands off router that just decides for you based on predicted quality and price.

Requesty. Strong on semantic caching and prompt level analytics. Their router uses embedding similarity to pick models and their cache layer deduplicates near identical prompts, which alone can cut costs 20 to 30 percent for workloads with repetitive queries.

Not Diamond. A pure routing specialist. You send a prompt to Not Diamond, it returns the recommended model to use, and you call that model yourself. This gives you more control than a gateway but requires more plumbing. Their router is trained on large quality benchmarks and updated as new models ship.

Semantic Caching: The Free 20 Percent

Before you even touch routing logic, turn on semantic caching. In most production workloads, 15 to 30 percent of prompts are duplicates or near duplicates of recent prompts. A caching layer that stores embeddings of past prompts and serves the stored response when a new prompt exceeds a similarity threshold, typically 0.95 cosine similarity, returns instant answers at zero model cost.

Portkey, Requesty, and Helicone all ship semantic cache out of the box. You can also build it yourself with Redis plus a small embedding model like OpenAI text-embedding-3-small at 0.02 dollars per million tokens. The trade off is cache staleness. For customer support bots where answers rarely change, caching for hours or days is safe. For real time data retrieval, you set shorter TTLs or skip caching on endpoints that require freshness.

Combined with routing, semantic caching stacks. A cached hit costs nothing. A cache miss gets routed to the cheapest capable tier. A failed tier cascades to the next. Your effective blended cost per request drops in stages.

Prompt caching is the other half of the story. Both Anthropic and OpenAI now offer native prompt caching where you mark static portions of your prompt, typically system instructions and retrieved context, as cacheable. Anthropic charges 1.25x the base input price to write to cache and 0.10x to read from cache. For a 10 thousand token system prompt that you reuse across a user session, you pay full price once, then one tenth of the price for every subsequent call. On long running agents this alone can cut input costs by 80 percent. See our guide on managing LLM API costs for a deeper walkthrough of prompt caching mechanics.

Fallback Chains and Quality Guardrails

Routing without fallbacks is fragile. Providers go down. Rate limits trigger. Models occasionally return garbage. A production routing layer needs three resilience mechanisms.

Provider fallback. If your primary call to Claude Sonnet 4.5 fails with a 529 or times out, the router automatically retries against GPT-4o or Gemini 2.5 Pro. Portkey and OpenRouter let you configure this as an ordered list. The fallback should be a model of comparable quality, not a downgrade to Tier 1, unless you explicitly accept the trade off.

Quality gated upgrade. You run the cheap model first, score the output with a small judge model or a heuristic, and if it fails to meet a quality threshold, re run the prompt on a higher tier. This is sometimes called cascading routing or model cascades. Research from 2024 and 2025 showed cascades can match frontier quality at 30 to 50 percent of the cost on many benchmarks. The gotcha is that the judge must be well calibrated. A bad judge routes everything up and defeats the purpose.

Cost caps and circuit breakers. Set hard spending limits per user, per endpoint, per day. If a single user suddenly consumes 500 dollars of Opus 4 in an hour, something has gone wrong, either a prompt injection, a runaway agent loop, or abuse. Your router should throttle or reject before that becomes a 50 thousand dollar incident. Portkey and Helicone both ship budget alerts.

Server infrastructure with monitoring overlays

Benchmarking Quality: You Cannot Route What You Cannot Measure

The single biggest failure mode in model routing is skipping the eval step. Teams enable routing, cost drops 50 percent, everyone celebrates, then two weeks later the support queue fills with complaints about weird answers. You prevented the cost problem and created a quality problem.

Before you route a single production request to a cheaper model, build a task specific eval set. Fifty to two hundred representative prompts with expected outputs or quality rubrics. Run every candidate model against that set, score the outputs with a stronger judge model like Opus 4 or GPT-5, and compare. If Haiku 4 scores 92 percent on your classification eval and Sonnet 4.5 scores 94, route it to Haiku. If Haiku scores 71 percent, do not.

The eval set is also how you handle model drift. When Anthropic ships Haiku 5 or Google ships Flash 3, you re run the suite, check that quality holds or improves, and update your routing table. Without an eval set you are guessing. We maintain versioned eval suites for every client we run routing layers for, and we re run them monthly or on any model update. It takes a day to set up and saves entire product launches.

Some teams ask whether self hosting a fine tuned model is cheaper than routing at all. For high volume, narrow task workloads, sometimes yes. Our analysis in self hosted LLMs versus API walks through the break even math. For most teams, routing against commercial APIs remains the better ROI because you avoid GPU capacity management and benefit from continuous quality improvements.

A Real Implementation: From 180K to 55K in Six Weeks

Here is a concrete example from a recent engagement. A Series B SaaS client running an AI powered customer support product was burning 180 thousand dollars a month on Claude Sonnet 4.5 across roughly 12 million requests. Their endpoints broke down as follows. Intent classification, 45 percent of traffic. FAQ answer lookup, 22 percent. Ticket summarization, 18 percent. Multi turn conversational resolution, 15 percent.

Week one, we added Portkey as the gateway and turned on semantic caching with a 24 hour TTL on the FAQ endpoint. Cache hit rate stabilized at 34 percent, cutting that endpoint in half. Week two, we built eval sets for classification and summarization, confirmed Haiku 4 scored within 2 points of Sonnet 4.5 on both, and routed them. That shifted 63 percent of traffic to a model 15 times cheaper on input and 3 times cheaper on output.

Week three, we enabled Anthropic prompt caching on the conversational endpoint where the system prompt and retrieved knowledge base context stayed constant across a session. Input costs on that endpoint dropped 78 percent. Week four, we added fallback chains from Sonnet 4.5 to GPT-4o with a 10 second timeout. Weeks five and six were monitoring, tuning thresholds, and building the internal dashboard their finance team wanted.

Final bill for the following month was 54 thousand 800 dollars. Quality scores on their internal CSAT proxy held flat within statistical noise. Total engineering time was around 80 hours across two engineers. If your LLM spend is north of 20 thousand dollars a month and you are still running everything through a single frontier model, you are leaving the same kind of money on the table.

If you want help designing a routing and caching layer for your own stack, Book a free strategy call and we will walk through your current spend, traffic distribution, and the fastest path to cutting it in half.

Need help building this?

Our team has launched 50+ products for startups and ambitious brands. Let's talk about your project.

AI model routingLLM cost optimizationPortkeyOpenRoutersemantic caching

Ready to build your product?

Book a free 15-minute strategy call. No pitch, just clarity on your next steps.

Get Started