Context Engineering Is Not Just RAG with Extra Steps
If you have built an AI product, you have probably hit the same wall everyone else hits. The model is plenty smart. The problem is that it has no idea what is going on. It hallucinates because you gave it a 200-token system prompt and expected it to answer questions about your entire knowledge base. Or you stuffed 50,000 tokens of raw documents into the prompt and the model got confused, slow, and expensive.
That gap between "model can reason" and "model has the right information to reason about" is exactly what context engineering solves. And no, it is not just retrieval-augmented generation with a fancier name. RAG is one component of a context pipeline. A good pipeline also handles ranking, compression, multi-source assembly, caching, window management, and quality evaluation. Each of those stages has its own failure modes, and skipping any of them is how you end up with a product that works in demos but collapses under real user traffic.
At Kanopy Labs, we have built context pipelines for products ranging from legal research tools to customer support agents to financial analysis platforms. The patterns are remarkably consistent across domains. This guide walks through the full architecture, stage by stage, with specific tools, costs, and timelines so you can actually build one instead of just reading about it.
The Context Pipeline Architecture: Five Stages That Matter
A production context pipeline has five distinct stages, and understanding the boundaries between them is critical. Every stage transforms the data in some way before passing it downstream. If you blur the stages together, debugging becomes a nightmare because you cannot tell whether the model gave a bad answer because retrieval failed, ranking was off, compression dropped the key fact, or the injection format confused the model.
Stage 1: Retrieval
This is where you pull candidate information from one or more sources. Vector search against an embedding index is the most common pattern, but production systems almost always combine it with keyword search (BM25) in a hybrid approach. Tools like Pinecone, Weaviate, Qdrant, or pgvector handle the vector side. For hybrid search, LlamaIndex and LangChain both offer built-in retrievers that merge dense and sparse results.
The retrieval stage should be intentionally over-inclusive. Pull 20 to 50 candidate chunks even if you only plan to use 5. The ranking stage will handle precision. Trying to be too precise at retrieval time means you will miss relevant documents that happen to have low cosine similarity but high semantic relevance.
Stage 2: Ranking
Raw retrieval scores are noisy. A dedicated reranking step takes your candidate set and reorders it by actual relevance to the query. Cohere Rerank, Jina Reranker, and cross-encoder models from Hugging Face are the go-to options here. The cost is minimal (Cohere charges roughly $1 per 1,000 rerank calls) and the quality improvement is dramatic. We have seen reranking improve answer accuracy by 15 to 25 percent in production systems, which is often a bigger lift than switching to a more expensive LLM.
Stage 3: Compression
Even after ranking, your top results may contain redundant information, boilerplate, or content that is tangentially relevant. Compression reduces the token count without losing the facts that matter. You can use extractive compression (pulling key sentences) or abstractive compression (using a smaller, cheaper model like Claude Haiku to summarize chunks before they go to the main model). LangChain's ContextualCompressionRetriever wraps this pattern nicely.
Stage 4: Assembly
This is where you combine compressed context from multiple sources into a single coherent prompt payload. We cover this in depth in the multi-source assembly section below.
Stage 5: Injection
How you format and position context inside the prompt matters more than most teams realize. We will cover injection patterns and context window management in their own sections. If you want deeper background on the retrieval and ranking stages specifically, our guide on RAG architecture goes into more detail on embedding strategies and chunking approaches.
Context Window Management: The Budget You Cannot Ignore
Every context pipeline operates under a hard constraint: the model's context window. Even with Claude's 200K token window or Gemini's million-token window, treating context length as unlimited is a recipe for degraded quality and ballooning costs. Longer prompts cost more, take longer to process, and research consistently shows that models perform worse on information buried in the middle of very long contexts (the "lost in the middle" problem documented by Liu et al.).
The right approach is to set a context budget for every request. At Kanopy Labs, we typically allocate the window like this:
- System prompt and instructions: 500 to 2,000 tokens. Keep this tight. If your system prompt is over 2,000 tokens, you are probably embedding business logic that belongs in code.
- Retrieved context: 3,000 to 8,000 tokens for most use cases. This is your ranked, compressed, assembled context payload.
- Conversation history: 2,000 to 4,000 tokens of recent turns. Summarize older turns rather than including them verbatim.
- Output buffer: Reserve at least 1,000 to 2,000 tokens for the model's response.
That means a typical request uses 8,000 to 16,000 tokens total, even when the model supports 200K. This is deliberate. Tighter context produces better answers and costs less. With Claude on the Anthropic API, every 1,000 input tokens costs $3 for Opus or $0.80 for Sonnet. At 10,000 requests per day, the difference between a 15K token prompt and a 50K token prompt is hundreds of dollars daily.
Implement hard token limits at each pipeline stage. Your retriever should have a max_tokens parameter. Your compressor should have a target output length. Your assembler should enforce a total budget and drop the lowest-ranked sources if the budget is exceeded. Never let a single stage consume the entire window. These are the kinds of constraints that feel annoying during development but save you from mysterious quality regressions in production.
For a deeper look at how context and memory interact across conversation turns, check out our guide on AI memory and context engineering.
Multi-Source Context Assembly: Where Most Pipelines Break
Real products do not pull context from a single source. A customer support agent might need the user's account data from a SQL database, relevant help articles from a vector index, the last three support tickets from an API, and the current conversation history. A financial analyst tool might combine SEC filings, earnings call transcripts, market data, and the user's portfolio. Assembling all of this into a coherent context payload is where most pipelines fall apart.
The first mistake teams make is treating all sources equally. They dump everything into the prompt in whatever order the async calls resolve. The model has no way to distinguish authoritative data from supplementary context. The fix is simple: use explicit source labels and priority ordering.
Here is what a well-structured assembled context looks like in practice:
- Structured data first. Account details, configuration, user profile. This is factual and the model should treat it as ground truth. Label it clearly: "User Account Data (authoritative)".
- Retrieved documents second. Ranked and compressed chunks from your knowledge base. Label each with its source and relevance score so the model can weigh them appropriately.
- Conversation history third. Recent turns, summarized older turns. The model needs this for coherence but it should not override factual data.
- Supplementary context last. Examples, style guides, edge case instructions. Useful but lowest priority if tokens are tight.
The second mistake is not handling source conflicts. What happens when your vector search returns a help article that contradicts the structured data from the database? You need a conflict resolution strategy, and the simplest one that works is: structured data wins. Always. If the database says the user's plan is "Enterprise" and a cached help article references "Pro" tier features, the model should trust the database. Encode this hierarchy directly in your system prompt.
On the implementation side, LangChain's EnsembleRetriever and LlamaIndex's QueryFusionRetriever both handle merging results from multiple sources. But neither handles the labeling or priority ordering for you. That logic lives in your assembly layer, which is typically 200 to 400 lines of straightforward code that formats everything according to your template.
Caching Strategies That Actually Save Money
Context pipelines have a dirty secret: most of the information they retrieve does not change between requests. If a user asks three follow-up questions about the same topic, you are running the same vector search, the same reranking, and the same compression three times. At scale, this burns through API budgets and adds latency that users notice.
There are three levels of caching you should implement, and each one targets a different part of the pipeline:
Level 1: Retrieval Cache
Cache the raw retrieval results keyed on a normalized version of the query. Redis works well here. Set a TTL of 5 to 15 minutes for most use cases. If your knowledge base updates infrequently (documentation, help articles, policy documents), you can push the TTL to an hour or more. This eliminates redundant vector searches, which are your most expensive retrieval operation. A typical Pinecone query costs fractions of a cent, but at 100,000 queries per day, it adds up.
Level 2: Prompt Cache
Anthropic's prompt caching feature is a game-changer for context pipelines. If your system prompt and context payload share a common prefix across requests (and they usually do), the cached tokens cost 90% less and process 85% faster. Structure your prompts so that the static parts (system instructions, base context) come first, and the variable parts (user query, conversation-specific data) come last. This maximizes cache hit rates. We have seen teams cut their Claude API costs by 40 to 60 percent just by restructuring prompt order to take advantage of caching.
Level 3: Semantic Cache
This is the most advanced layer. Instead of exact-match caching, you embed incoming queries and check whether a semantically similar query has been answered recently. If the cosine similarity exceeds a threshold (typically 0.95 or higher), you return the cached response. GPTCache and LangChain's caching integrations support this pattern. The risk is returning stale or slightly wrong answers, so use a high similarity threshold and short TTLs. We recommend this only for high-volume, low-stakes use cases like FAQ bots or search suggestions.
The combined effect of all three layers is significant. For a typical SaaS product handling 50,000 context pipeline runs per day, we have seen total LLM costs drop from $800/day to $250/day after implementing retrieval and prompt caching alone. If you are building a system that manages prompts across many users and use cases, our guide on building a prompt management system covers how to structure prompt templates for maximum cache efficiency.
Evaluation and Quality Metrics: Know When Your Pipeline Is Broken
Building the pipeline is only half the job. If you cannot measure whether it is working, you are flying blind. Most teams track end-to-end metrics like "did the user thumbs-up the response" and call it a day. That is not enough. When quality drops, you need to know which stage failed, and end-to-end metrics do not tell you that.
Here are the stage-level metrics we track in every production pipeline:
- Retrieval recall: Of the documents that should have been retrieved for a given query, how many actually were? Measure this against a golden test set of 50 to 100 query-document pairs. If recall drops below 80%, your embedding model or chunking strategy needs work.
- Reranking precision at K: After reranking, are the top K results actually the most relevant? Use the same golden set. Precision at 5 should be above 70% for most domains.
- Compression fidelity: Does the compressed context preserve the key facts from the original chunks? Use an LLM-as-judge to compare compressed vs. original on a sample of requests. Flag cases where compression drops critical information.
- Context utilization: What percentage of injected context tokens does the model actually reference in its response? Low utilization (below 30%) suggests you are injecting too much irrelevant context. High utilization (above 80%) suggests the model is using everything, which could mean your retrieval is too narrow.
- Answer groundedness: Can every claim in the model's response be traced back to the provided context? Tools like Ragas, DeepEval, and Galileo automate this check. Groundedness below 85% is a red flag for hallucination.
Run these metrics on every deployment, not just during development. Set up automated evaluation suites using Ragas or DeepEval that run against your golden test set in CI/CD. When a metric drops below its threshold, block the deployment. This sounds aggressive, but it is far cheaper than shipping a broken pipeline to production and dealing with user complaints for three days before someone notices.
For the LLM-as-judge evaluations, Claude Sonnet is our go-to. It is accurate enough for quality judgments and costs roughly $0.003 per evaluation call. Budget $50 to $100 per month for automated evaluation across a typical pipeline.
Production Patterns and Getting Started
Let us get practical. Here are the production patterns that separate a pipeline that works in staging from one that holds up under real traffic.
Pattern 1: Graceful Degradation
Your vector database will go down. Your reranker API will time out. Your cache will get evicted. Design every stage with a fallback. If the reranker fails, fall back to raw retrieval scores. If the vector search is slow, fall back to keyword search. If caching is unavailable, proceed without it. Never let a single stage failure crash the entire pipeline. Implement circuit breakers using libraries like pybreaker (Python) or opossum (Node.js) so that repeated failures in one stage trigger the fallback automatically.
Pattern 2: Async Parallel Retrieval
When you are pulling context from multiple sources (database, vector index, API), run all retrievals in parallel. This seems obvious, but we regularly audit codebases where multi-source retrieval is sequential. A pipeline that queries a database (50ms), a vector index (200ms), and an external API (300ms) sequentially takes 550ms. In parallel, it takes 300ms. At the scale of a real product, those 250ms matter for user experience.
Pattern 3: Context Versioning
Log the exact context payload that was sent to the model for every request. When users report bad answers, you need to see exactly what the model saw. Store context snapshots in a structured log (we use BigQuery, but any analytics warehouse works) with the request ID, timestamp, context sources, token counts per source, and the model's response. This turns debugging from "the AI said something wrong" into "the retriever pulled the wrong document because the embedding for chunk 47 drifted after the last re-index."
Pattern 4: Progressive Context Loading
For conversational products, do not load all context upfront. Start with minimal context (system prompt plus user query), generate an initial response or clarifying question, then load additional context only when needed. This reduces latency for simple queries and saves tokens on interactions that do not require deep retrieval. Claude's tool use capability makes this pattern natural: the model can decide when it needs more context and call a retrieval tool on demand.
Getting Started: Timeline and Costs
A minimal context pipeline (retrieval, basic ranking, injection) takes a senior engineer about two weeks to build and costs $200 to $500 per month to run at moderate scale (10,000 to 50,000 requests per day). Adding compression, caching, and evaluation brings the build time to four to six weeks and the operating cost to $500 to $1,500 per month, depending on your LLM provider and traffic volume. The ROI is clear: teams that invest in context engineering see 20 to 40 percent improvements in answer quality and 30 to 50 percent reductions in LLM costs compared to naive approaches.
If you are building an AI product and struggling with answer quality, hallucination, or LLM costs, the context pipeline is almost certainly where the biggest improvements are hiding. We have helped dozens of teams design and ship these systems. Book a free strategy call and we will walk through your specific architecture and show you where the quick wins are.
Need help building this?
Our team has launched 50+ products for startups and ambitious brands. Let's talk about your project.