What Makes a Research Agent Different from a Chatbot
A chatbot answers questions from its training data or a fixed knowledge base. A research agent actively searches for information, evaluates sources, synthesizes findings, and presents results with citations. The difference is agency: the ability to take actions (search the web, read documents, run calculations) in pursuit of an answer.
Perplexity proved the model: take a user query, decompose it into sub-questions, search multiple sources, verify and cross-reference facts, and synthesize a cited answer. The result is dramatically better than either a search engine (which gives you links to read) or a chatbot (which gives you answers without sources).
The vertical opportunity is enormous. A general-purpose research agent competes with Perplexity and Google. A research agent specialized for legal due diligence, medical literature review, financial analysis, or academic research serves users who will pay $50 to $500/month for a tool that saves them hours of manual research per week.
If you have worked with RAG architectures, a research agent extends RAG with an agent loop that decides what to search for, evaluates results, and iterates until the question is answered satisfactorily.
The Agent Loop Architecture
The core of a research agent is the agent loop: a cycle of thinking, acting, observing, and deciding whether to continue or deliver a final answer.
Step 1: Query Analysis
The agent receives a user query and decides how to approach it. For simple factual questions ("What is the population of Tokyo?"), a single search suffices. For complex research questions ("What are the trade-offs between graph RAG and vector RAG for legal document retrieval?"), the agent decomposes the query into sub-questions and plans a research strategy.
Step 2: Tool Selection and Execution
The agent chooses which tools to use: web search (Brave Search, Serper, or Tavily API), document retrieval (search your internal knowledge base), code execution (for calculations or data analysis), and specialized APIs (academic paper search via Semantic Scholar, patent search via USPTO, financial data via Alpha Vantage). Each tool call returns results that the agent processes.
Step 3: Result Evaluation
The agent evaluates whether the tool results adequately answer the query. If not, it decides what to search next. This is where the "agent" part matters: instead of a single retrieve-and-generate step, the agent iterates. It might search for "graph RAG benefits," realize it needs more on "limitations of graph RAG at scale," and issue a follow-up search.
Step 4: Synthesis and Citation
Once the agent has gathered enough information, it synthesizes a response. Every factual claim must cite a specific source with a URL or document reference. The synthesis step is a carefully prompted LLM call that produces structured output: answer text with inline citations, a reference list, and confidence indicators for each claim.
Step 5: Quality Check
Before delivering the final answer, run a verification step: do all citations point to real sources? Does the answer actually address the original query? Are there contradictions between cited sources? If quality checks fail, the agent loops back to gather more information.
Tool Use Implementation
Tools are what give your research agent capabilities beyond the LLM's training data. Here is how to implement the most important ones.
Web Search
Use Tavily ($0.01 per search, optimized for LLM consumption), Brave Search API ($3 per 1,000 queries), or Serper ($0.004 per search, Google results). Tavily is the best for agents because it returns clean, LLM-ready text extracts rather than raw HTML. For each search result, fetch and parse the full page content (using Firecrawl or a custom scraper) to get complete context beyond the search snippet.
Document Reading
When the agent finds a relevant URL, it needs to read the full page content. Use Firecrawl (best for LLM-ready markdown output), Jina Reader API (free, converts URLs to clean text), or a custom scraper with Playwright for JavaScript-rendered pages. Strip navigation, ads, and boilerplate to reduce token usage.
Internal Knowledge Base
For vertical research agents, your proprietary knowledge base is the competitive advantage. Implement vector search over your curated corpus using pgvector or Pinecone. The agent should search both the web and your internal KB, with internal results weighted higher for domain-specific questions. Our AI search guide covers the retrieval layer in depth.
Code Execution
For research questions involving calculations, data analysis, or chart generation, give the agent a sandboxed Python environment (E2B, Modal, or a custom Docker container). The agent writes and executes code, then includes the results in its answer. This is essential for financial analysis, statistical queries, and data comparison tasks.
Citation and Source Verification
Citations are what make a research agent trustworthy. Without them, you have a chatbot that confidently states things that might be wrong.
Citation Implementation
Use structured output to force citations. In your system prompt, instruct the model to output JSON with inline citation markers: "Graph RAG achieves 20-40% higher accuracy on multi-hop queries [1]." The reference list maps each marker to a source URL, title, and relevant excerpt. Parse the structured output and render citations as clickable links in your UI.
Source Quality Scoring
Not all sources are equal. Build a source quality scoring system: primary sources (government data, company filings, academic papers) score highest, reputable publications (major newspapers, peer-reviewed journals) score high, industry blogs and reports score medium, forums and social media score low. Weight the agent's synthesis toward higher-quality sources and flag answers that rely primarily on low-quality sources.
Fact Verification
For critical factual claims, implement a verification step: after the agent generates an answer, extract key claims and search for corroborating sources. If a claim appears in only one source, flag it as unverified. If multiple independent sources agree, mark it as verified. This multi-source verification dramatically reduces hallucination in the final output.
Freshness Scoring
For time-sensitive queries ("What is the current state of AI regulation in the EU?"), penalize older sources. Extract publication dates from web pages and include recency as a factor in source ranking. Display the date of each cited source so users can assess freshness themselves.
Building for a Vertical: The Real Opportunity
General-purpose research agents compete with Perplexity (well-funded, massive scale) and Google (infinite resources). Vertical research agents compete with manual research processes and charge premium prices.
Legal Research Agent
Search case law (CourtListener, RECAP), statutes (govinfo.gov), and regulations. Understand legal citation formats (Bluebook). Return answers with proper legal citations and jurisdiction context. Target: law firms, corporate legal departments. Price: $200 to $500/user/month.
Medical Literature Research Agent
Search PubMed, clinical trial registries, and medical guidelines. Understand medical terminology and evidence levels (meta-analysis vs. case report). Return answers with study quality assessments. Target: healthcare organizations, pharma companies, clinicians. Price: $100 to $300/user/month.
Financial Research Agent
Search SEC filings, earnings transcripts, market data, and analyst reports. Run financial calculations (DCF models, ratio analysis) using the code execution tool. Return answers with specific data points and their sources. Target: investment firms, corporate finance teams. Price: $200 to $500/user/month.
Competitive Intelligence Agent
Monitor competitor websites, job postings, press releases, patent filings, and social media. Synthesize weekly competitive briefings. Alert on significant changes (new product launch, funding round, key hire). Target: product teams, strategy teams. Price: $100 to $300/user/month.
The vertical focus gives you two advantages: domain-specific data sources that general agents lack, and domain expertise in your prompts that produces more accurate, more useful answers. A legal research agent that understands Bluebook citations and case law hierarchy is worth 10x a generic agent to a lawyer.
Tech Stack and Production Considerations
LLM Selection
Claude Opus for complex reasoning and synthesis (the final answer generation). Claude Sonnet for query decomposition and tool selection (faster, cheaper, good enough for planning). GPT-4o as a fallback. For multi-agent systems, route different parts of the research pipeline to different models based on task complexity.
Agent Framework
LangGraph or CrewAI for the agent loop orchestration. These frameworks handle: tool registration and execution, conversation memory, step-by-step tracing, and error recovery. You can build the agent loop from scratch with direct API calls (simpler, more control) but frameworks save time on the orchestration plumbing.
Streaming and UX
Research agents take 10 to 60 seconds to produce a comprehensive answer. You cannot show a loading spinner for 60 seconds. Stream the agent's progress: "Searching for graph RAG benchmarks...", "Reading 4 sources...", "Verifying claims...", then stream the final answer token by token. This transparency keeps users engaged and builds trust in the research process.
Cost Management
A single research query can invoke 5 to 15 tool calls and 3 to 5 LLM calls. At $0.01 per search and $0.03 to $0.15 per LLM call, a complex query costs $0.20 to $1.00. Budget $3 to $10 per user per day for active researchers. Implement query complexity estimation and set cost guardrails: simple questions get 3 tool calls max, complex ones get up to 15.
Evaluation, Quality, and Next Steps
Measuring research agent quality requires domain-specific evaluation because generic LLM benchmarks do not capture research-specific skills.
Evaluation Metrics
- Citation accuracy: Do cited sources actually support the claims? Human evaluation on 100+ queries.
- Coverage: Does the answer address all aspects of the question? Compare against expert-written reference answers.
- Hallucination rate: What percentage of claims are not supported by any cited source? Target: under 2%.
- Source quality: Are cited sources authoritative for the domain?
- Freshness: For time-sensitive queries, are the sources current?
- Answer completeness: Would a domain expert consider the answer sufficient? 1 to 5 rating scale.
Building Your Eval Suite
Create 200+ test queries spanning your target domain. For each query, prepare a reference answer with expected citations. Run your agent against this suite weekly and track metrics over time. Automated evaluation (using a judge LLM to score answers against references) scales better than pure human evaluation but should be calibrated against human judgments periodically.
Launch Strategy
Start with a narrow domain and 50 beta users. Have them rate every answer and provide corrections. Use this feedback to improve your prompts, tool selection logic, and source quality scoring. Expand to adjacent domains only after achieving 85%+ satisfaction in your initial vertical.
Ready to build an AI research agent for your domain? Book a free strategy call and we will help you define the right vertical, data sources, and technical architecture for your research product.
Need help building this?
Our team has launched 50+ products for startups and ambitious brands. Let's talk about your project.