Technology·14 min read

Prompt Caching and Smart Routing: Cutting LLM API Costs by 80%

Most teams overpay for LLM inference by 3x to 5x because they send every request to the same expensive model with zero caching. Prompt caching and intelligent routing are the two highest-leverage cost optimizations you can make, and most teams ignore both.

Nate Laquis

Nate Laquis

Founder & CEO

Why Your LLM Bill Is 3x Higher Than It Needs to Be

If you are running an AI-powered product in production, your LLM API bill is almost certainly higher than it should be. We have audited the AI infrastructure of over 40 startups and mid-size companies in the past two years, and the pattern is remarkably consistent: teams pick a frontier model, hard-code it into every API call, and send identical system prompts and context windows with every single request. No caching. No routing. No cost awareness at the inference layer.

The result is predictable. A B2B SaaS product handling 100,000 requests per day on Claude Opus with a 4,000-token system prompt is paying roughly $18,000 per month just in input token costs for that system prompt alone. That system prompt is identical across every request. It never changes. Yet the team pays full price for it every single time. That is the most obvious waste, and it is just the beginning.

Analytics dashboard showing LLM API cost metrics and usage patterns

The second source of waste is sending every query to the same model regardless of complexity. A user asking "What are your business hours?" gets routed to the same $15-per-million-token model as a user asking for a detailed competitive analysis. That simple query could be handled by a model costing $0.25 per million tokens with no perceptible quality difference. The opportunity cost of ignoring these two optimizations, prompt caching and smart routing, typically ranges from 60% to 85% of your total LLM spend.

This article covers both strategies in depth: how prompt caching works across all major providers, how to design cache-friendly prompts, how to build a model routing layer, and the specific tools and implementation patterns that let you cut costs without sacrificing quality. These are not theoretical savings. We have implemented these patterns for production systems and watched monthly bills drop from $45,000 to under $9,000.

How Prompt Caching Works: Anthropic, OpenAI, and Google

Prompt caching lets you avoid paying full price for the static portions of your prompt on repeated requests. The core idea is simple: if the first N tokens of your prompt are identical across multiple API calls, the provider can cache that prefix and charge you a fraction of the normal input token cost on subsequent requests. The implementations differ across providers, but the economics are compelling everywhere.

Anthropic's Prompt Caching

Anthropic's implementation is the most explicit and, in our experience, the most cost-effective. You mark specific blocks of your prompt with a cache_control parameter set to "ephemeral." When the API receives a request, it checks whether the cached prefix matches a previous request. If it does, you pay 90% less for those cached tokens. The first request that creates the cache incurs a 25% write premium, but that cost is amortized almost immediately if you are making more than a handful of requests. The cache has a 5-minute TTL that resets with each cache hit, so active workloads keep the cache warm indefinitely.

For a real example: a customer support bot with a 3,500-token system prompt handling 50,000 requests per day on Claude Sonnet saves roughly $4,200 per month just from caching the system prompt. That is a single line of code added to the API call. The cache supports up to four breakpoints, so you can cache your system prompt, a set of few-shot examples, a knowledge base chunk, and a conversation history prefix independently.

OpenAI's Automatic Caching

OpenAI takes a different approach. Caching is automatic and requires no code changes. Any prompt prefix of 1,024 tokens or more that repeats across requests is eligible for a 50% input cost discount. There is no write premium. The tradeoff is that you get less control: you cannot force specific sections to be cached, the minimum prefix length is higher (1,024 tokens vs. Anthropic's 1,024 for Claude but with explicit control), and the discount is smaller (50% vs. Anthropic's 90%). For teams already on OpenAI who do not want to refactor their prompt structure, this is a painless win. For teams optimizing aggressively, the 50% discount leaves significant savings on the table compared to Anthropic's 90%.

Google's Context Caching

Google's Gemini offers context caching as a separate API endpoint. You create a cached content object with your static context, specify a TTL, and then reference that cache in subsequent requests. Cached tokens are billed at 75% less than standard input tokens. The minimum cacheable content is 32,768 tokens, which makes this most useful for large-context workloads like document analysis or RAG pipelines with substantial retrieval contexts. The per-hour storage cost for the cache is minimal, roughly $0.50 per million tokens per hour, but it means you are paying for cache storage even when not actively querying.

The bottom line: if you are on Anthropic, prompt caching is the single highest-ROI optimization you can make. The 90% discount with explicit cache control is unmatched. If you are on OpenAI, you get automatic savings with zero effort but less control. If you are on Google, context caching excels for large-context workloads but has a higher minimum threshold. For a deeper comparison of provider pricing, see our LLM API pricing breakdown.

Designing Cache-Friendly Prompts

Getting the most out of prompt caching is not just about flipping a switch. The structure of your prompt determines how much of it can be cached and how often cache hits occur. Most prompts are structured in the worst possible way for caching, with dynamic content mixed into the beginning and static content buried at the end.

The Golden Rule: Static Content First, Dynamic Content Last

Prompt caching works on prefixes. The provider caches everything from the start of your prompt up to the cache breakpoint. Any content after the breakpoint is treated as new, uncached input. This means your prompt structure should follow a strict order: system instructions first, then few-shot examples, then retrieved context or knowledge base chunks, then conversation history, and finally the user's current query. Every piece of dynamic or request-specific content should be pushed as far toward the end as possible.

A common anti-pattern is embedding the user's name or account-specific details into the system prompt. Something like "You are an assistant for Acme Corp, helping John Smith with his enterprise account" at the top of every prompt. That one line of personalization invalidates the entire cache because the prefix changes with every user. Instead, move all personalization into a user message block after the cached system prompt. Your system prompt should contain only instructions that are identical across all users and all requests.

Layered Caching with Multiple Breakpoints

Anthropic supports up to four cache breakpoints, which lets you build a layered caching strategy. The first layer is your base system prompt, which is identical across all requests and all users. The second layer is your few-shot examples, which might vary by feature or task type but remain stable within a feature. The third layer is retrieved context from your knowledge base, which changes based on the query but often repeats for similar questions. The fourth layer is conversation history, which grows with each turn but shares a common prefix.

This layered approach means that even when the full prompt is not cached, portions of it are. A request from a new user on a topic you have seen before might get a cache hit on layers one, two, and three, paying full price only for the new conversation turn. In practice, we see 70% to 85% of total input tokens served from cache in well-structured production systems.

Few-Shot Example Ordering

If you use few-shot examples in your prompts, keep them in a consistent order and add new examples at the end. Adding or reordering examples at the beginning of the few-shot block invalidates the cache for every example that follows. We recommend maintaining a canonical, versioned list of few-shot examples and appending task-specific examples after the canonical set. This way, the canonical examples stay cached even when you add specialized ones for different query types.

Developer writing cache-optimized prompt code for LLM API integration

One more detail that trips up teams: whitespace and formatting matter. Prompt caches match on exact token sequences. A trailing space, a different newline character, or an extra line break will cause a cache miss. Normalize your prompt templates rigorously. Use a prompt management layer that strips trailing whitespace and enforces consistent formatting before sending requests to the API.

Semantic Caching: Going Beyond Exact Matches

Provider-level prompt caching only works for exact prefix matches. If two users ask the same question with slightly different wording, both pay full price. Semantic caching solves this by adding an application-level cache that matches queries by meaning rather than by exact text.

How Semantic Caching Works

The concept is straightforward. When a request comes in, you generate an embedding of the user's query using a cheap embedding model (OpenAI's text-embedding-3-small costs $0.02 per million tokens). You compare that embedding against a vector store of previously seen queries. If the cosine similarity exceeds a threshold (typically 0.95 to 0.98), you return the cached response instead of calling the LLM. If it does not match, you call the LLM, store the response, and index the query embedding for future matches.

The savings can be dramatic. In customer support and FAQ-style applications, 30% to 60% of incoming queries are semantically identical to previous queries. A SaaS product fielding 80,000 support queries per month with a 40% cache hit rate on Claude Sonnet saves roughly $3,800 per month in LLM costs, minus about $20 in embedding costs. The ROI is absurd.

When Semantic Caching Works Well

Semantic caching is most effective for workloads with high query repetition and stable answers: customer support bots, FAQ assistants, product recommendation engines, and internal knowledge bases. In these contexts, the same questions recur constantly with minor wording variations. "How do I reset my password," "I need to change my password," and "password reset help" should all return the same cached response.

It also works well for search-augmented generation (RAG) pipelines where the retrieval step is deterministic. If two queries retrieve the same set of documents, the LLM's response will be nearly identical. Caching at the full-prompt level (query plus retrieved context) avoids redundant LLM calls.

When Semantic Caching Hurts

Semantic caching is dangerous when answers depend on real-time data, user-specific state, or evolving context. If a user asks "What is my account balance?" and you return a cached response from a different user's identical query, you have a serious problem. Similarly, if your product answers questions about live inventory, current pricing, or recent events, cached responses can quickly become stale and misleading.

The mitigation is to apply semantic caching selectively. Tag queries with metadata (user ID, data freshness requirements, personalization level) and only cache queries that are safe to cache. Build explicit cache invalidation triggers tied to data updates. A product catalog change should flush all cached responses that reference product data. This requires more engineering effort than provider-level caching, but the cost savings justify the investment for high-volume workloads. For a broader view of cost management strategies, see our guide on managing LLM API costs.

Smart Routing: Sending Every Query to the Right Model

Prompt caching reduces the cost of each individual API call. Smart routing reduces costs by sending each request to the cheapest model that can handle it well. Together, they compound. A query that hits the prompt cache and routes to a budget model can cost 95% less than the same query sent uncached to a frontier model.

Complexity-Based Routing

The simplest and most effective routing strategy classifies incoming queries by complexity and routes accordingly. Simple, well-defined tasks like classification, entity extraction, formatting, translation, and yes/no questions go to budget models (Claude Haiku, GPT-4o-mini, Gemini Flash) at $0.25 to $0.50 per million input tokens. Medium-complexity tasks like summarization, Q&A with context, conversational responses, and standard content generation go to mid-tier models (Claude Sonnet, GPT-4o, Gemini Pro) at $3 to $5 per million input tokens. Complex tasks that require deep reasoning, multi-step analysis, creative generation, or handling ambiguous queries go to frontier models (Claude Opus, GPT-5, Gemini Ultra) at $10 to $15 per million input tokens.

The routing decision itself can be made by a lightweight classifier. We typically use a fine-tuned Haiku or GPT-4o-mini model that classifies query complexity into three tiers based on features like query length, question type, domain specificity, and the presence of multi-step reasoning indicators. The classification call costs fractions of a cent and typically adds under 200ms of latency. The cost savings from correct routing dwarf the cost of the classification step by orders of magnitude.

Cost-Based Routing

Cost-based routing is a variant where you set a per-request budget and route to the cheapest model that meets your quality threshold. This works well for products with tiered pricing. Free-tier users get responses from Haiku. Pro users get Sonnet. Enterprise users get Opus. The routing logic is trivial, just a lookup table based on the user's plan, but the cost impact is substantial. A product with 70% free users, 25% pro users, and 5% enterprise users can reduce its blended cost per request by 60% compared to running everyone on the same model.

Latency-Based Routing

Some queries need fast responses (autocomplete, real-time suggestions, inline edits) while others can tolerate longer generation times (document analysis, report generation, batch processing). Latency-based routing sends time-sensitive requests to smaller, faster models and routes complex but non-urgent requests to larger models. Flash and Haiku models typically respond 3x to 5x faster than their frontier counterparts, so this strategy improves user experience and cuts costs simultaneously.

Dashboard showing real-time model routing metrics and cost analytics

Hybrid Routing

In practice, the best routing strategies combine all three dimensions. A request is evaluated for complexity, checked against the user's cost tier, and assessed for latency requirements. The router picks the model that satisfies all three constraints at the lowest cost. This sounds complicated, but the implementation is a simple scoring function. For each available model, compute a score based on: does it meet the quality bar for this complexity level? Does it fit within the cost budget for this user tier? Does it meet the latency requirement for this request type? Pick the cheapest model that passes all three checks. We have seen this approach reduce blended LLM costs by 70% to 80% with no measurable quality degradation on production workloads. For more detail on routing architectures, see our model routing deep dive.

Tools and Implementation Patterns

You do not need to build everything from scratch. Several mature tools handle prompt caching, model routing, and provider abstraction. Here is what we recommend and what we have seen work in production.

LiteLLM

LiteLLM is the most popular open-source proxy for unifying LLM provider APIs. It provides a single OpenAI-compatible interface that routes to over 100 models across Anthropic, OpenAI, Google, Cohere, and self-hosted models. It handles prompt caching parameters automatically for providers that support them, tracks cost per request, and supports fallback routing (if the primary model fails, try the secondary). For teams that want a self-hosted routing layer with full control, LiteLLM is the starting point. The setup takes about 30 minutes, and you can run it as a Docker container alongside your application.

Portkey

Portkey is a managed AI gateway that provides routing, caching, observability, and guardrails in a single platform. Its semantic caching feature is built in, so you do not need to manage a separate vector store. The routing engine supports conditional logic based on request metadata, user attributes, and model performance metrics. Portkey is our recommendation for teams that want these capabilities without the operational burden of running their own proxy infrastructure. The free tier handles up to 10,000 requests per month, which is enough to validate the approach before committing.

Vercel AI SDK

If you are building with Next.js or any JavaScript/TypeScript stack, the Vercel AI SDK provides the cleanest abstraction for multi-model applications. It does not handle caching or routing natively, but it gives you a unified interface for calling any provider. Combined with a simple routing function, you can build a cost-optimized inference layer in under 100 lines of code. The streaming support is excellent, and the provider-switching experience is seamless.

Custom Routing: When to Build Your Own

Build a custom router when your routing logic depends on business-specific signals that off-the-shelf tools cannot handle. For example, if your routing decision depends on the user's historical accuracy preferences, the sensitivity of the data being processed, or real-time model performance metrics from your own evaluation pipeline, you will outgrow generic routing tools quickly. The architecture is simple: a lightweight classification service that evaluates each request and returns a model selection, sitting between your application and a provider abstraction layer like LiteLLM or the Vercel AI SDK.

Implementation Order

If you are starting from zero, here is the order we recommend. First, add provider-level prompt caching. This requires minimal code changes (often just adding a cache_control parameter) and delivers immediate savings. Second, implement basic rule-based routing. Route known simple tasks to budget models. This takes a day of engineering time and typically cuts costs by 30% to 40%. Third, add semantic caching for high-repetition query patterns. This requires an embedding model and a vector store but pays for itself within the first week on high-volume workloads. Fourth, build a complexity classifier for dynamic routing. This is the most sophisticated step but yields the largest incremental savings on diverse workloads. Each step compounds on the previous ones, and you can stop at any point once the savings meet your targets.

Real Cost Savings: Before and After

Let us walk through three real scenarios with specific numbers. These are based on production systems we have optimized. The exact product details are anonymized, but the cost figures are accurate.

Scenario 1: B2B Customer Support Bot

A SaaS company running a customer support chatbot on Claude Opus. The bot handles 120,000 conversations per month with an average of 4 turns each (480,000 API calls). The system prompt is 3,800 tokens. The average user message plus retrieved context is 2,200 tokens. Monthly LLM cost before optimization: $52,000.

Optimizations applied: prompt caching on the system prompt (90% input cost reduction on 3,800 tokens per call), semantic caching with a 0.96 similarity threshold (38% of queries served from cache), and complexity-based routing (65% of queries routed to Sonnet, 20% to Haiku, 15% remaining on Opus). Monthly LLM cost after optimization: $8,400. That is an 84% reduction. The engineering effort was approximately 3 weeks, including testing and staged rollout.

Scenario 2: Document Analysis Platform

A legal tech startup processing contracts through an LLM pipeline: extraction, summarization, risk flagging, and clause comparison. Processing 8,000 documents per month with an average of 45,000 tokens per document. Running entirely on GPT-5. Monthly LLM cost before optimization: $38,000.

Optimizations applied: Google Gemini context caching for the base analysis prompt and few-shot examples (the 32K minimum threshold was easily met), routing extraction and classification subtasks to GPT-4o-mini, keeping summarization on GPT-4o, and reserving GPT-5 only for risk flagging and complex clause analysis. Semantic caching was not applicable here because every document is unique. Monthly LLM cost after optimization: $11,500. That is a 70% reduction. The team also saw a 40% improvement in end-to-end processing speed because the smaller models responded faster on the simpler subtasks.

Scenario 3: Consumer AI Assistant

A consumer app with a general-purpose AI assistant handling 2 million requests per month. Running on Claude Sonnet with a 2,000-token system prompt. Monthly LLM cost before optimization: $28,000.

Optimizations applied: Anthropic prompt caching on the system prompt (90% reduction on the static prefix), routing 55% of simple queries (greetings, factual lookups, formatting requests) to Haiku, and semantic caching with a 0.97 threshold catching 25% of queries. Monthly LLM cost after optimization: $5,200. That is an 81% reduction. The app's average response time actually improved because Haiku responds faster than Sonnet for the simple queries that were routed down.

The pattern across all three scenarios is consistent. Prompt caching alone delivers 20% to 35% savings. Adding routing delivers another 30% to 45%. Adding semantic caching delivers an additional 10% to 25% on applicable workloads. The strategies compound multiplicatively, not additively, because you are caching the prompts on cheaper models.

When Caching and Routing Hurt: Pitfalls to Avoid

These optimizations are not universally beneficial. There are real scenarios where caching introduces bugs, routing degrades quality, and the added complexity is not worth the savings. Knowing when not to optimize is as important as knowing how.

Cache Staleness

Semantic caching is a liability when your data changes frequently. If your product answers questions about live inventory, real-time pricing, or breaking news, a cached response from 10 minutes ago could be wrong. The fix is straightforward: do not cache time-sensitive queries, or set aggressive TTLs (under 60 seconds) and invalidate caches when underlying data changes. But if your team does not build robust cache invalidation from the start, stale responses will slip through and erode user trust.

Routing Quality Degradation

The biggest risk with model routing is misclassifying a complex query as simple and sending it to a model that cannot handle it well. A user asking a nuanced question about tax implications of stock options should not get a Haiku response. The failure mode is subtle: the cheaper model returns a confident, fluent, and wrong answer. The user does not know the response is low quality because it reads well.

Mitigate this by erring on the side of routing up rather than down. Set your complexity classifier to default to the mid-tier model and only route to the budget tier for queries it classifies with high confidence (above 0.9) as simple. Run A/B tests comparing routed responses to baseline (all traffic on the premium model) and track quality metrics like user satisfaction scores, follow-up question rates, and explicit feedback signals. If quality degrades on any segment, tighten your routing thresholds.

Over-Engineering for Low Volume

If your LLM spend is under $500 per month, building a semantic caching layer with a vector store and a routing classifier is over-engineering. The operational complexity and maintenance burden will cost more in engineering time than you save in API costs. Start with provider-level prompt caching (near zero effort) and basic rule-based routing (a few hours of work). Graduate to more sophisticated approaches only when your monthly spend justifies the investment, typically above $3,000 to $5,000 per month.

Caching Personalized Responses

If your LLM generates responses that are personalized to each user (referencing their name, account data, preferences, or history), semantic caching can serve one user's personalized response to a different user with a similar query. This is not just a bad user experience. It is a potential data leak. Either exclude personalized queries from semantic caching entirely, or include user-specific metadata in your cache key so that cached responses are only served to the same user who generated them.

The Complexity Tax

Every layer you add to your inference pipeline is a layer that can fail. A semantic cache introduces a vector store dependency. A routing classifier introduces a classification model dependency. A caching proxy introduces a network hop. Each component needs monitoring, alerting, and graceful degradation when it goes down. Design your system so that cache misses and router failures fall through to a sensible default (typically calling the mid-tier model directly). Never let an optimization layer become a single point of failure for your entire AI feature set.

The teams that get the best results treat cost optimization as an iterative process, not a one-time project. Start with the easiest wins (prompt caching, basic routing), measure the impact, and add complexity only when the data supports it. If you want help auditing your current LLM spend and identifying the highest-impact optimizations for your specific workload, book a free strategy call and we will walk through your architecture together.

Need help building this?

Our team has launched 50+ products for startups and ambitious brands. Let's talk about your project.

prompt caching LLM API cost reductionLLM smart routingAI inference cost optimizationsemantic caching embeddingsmodel routing strategies

Ready to build your product?

Book a free 15-minute strategy call. No pitch, just clarity on your next steps.

Get Started