Why LLM Observability Is a Hard Problem
You shipped an AI feature. Users love it. Then three weeks later, support tickets start rolling in: "the AI gave me wrong information," "it keeps repeating itself," "it charged me $47 for one query." You open your Datadog dashboard and see... nothing useful. HTTP 200s across the board. Latency looks fine. No errors in Sentry.
This is the core problem with LLM applications. They fail silently. A hallucinated answer returns a 200 status code just like a correct one. A prompt regression after a model update produces grammatically perfect garbage. A runaway agent loop burns tokens without triggering any traditional alert. You need specialized tooling that understands prompts, completions, token economics, and output quality.
Three platforms have emerged as serious contenders for this space: LangSmith (from the LangChain team), Braintrust (evaluation-first approach), and Arize with its open-source Phoenix library. Each takes a fundamentally different angle on the same problem, and picking the wrong one for your workflow creates months of integration pain.
We have deployed all three across different client projects. This is not a feature matrix copy-pasted from marketing pages. It is a practitioner's comparison based on real production workloads, real cost numbers, and real developer experience feedback from teams shipping AI products daily.
LangSmith: Deep Integration, LangChain Lock-in
LangSmith is the observability and evaluation platform built by the LangChain team. If you are already using LangChain or LangGraph for your agent orchestration, LangSmith is the path of least resistance. Tracing is automatic. Every chain step, retriever call, tool invocation, and LLM request gets captured with zero additional instrumentation code. You literally flip a flag and your entire execution graph appears in the dashboard.
The trace visualization is best-in-class for complex chains. You can see exactly where a multi-step agent spent its time, which tool calls succeeded or failed, what context was retrieved, and how the final response was assembled. For debugging a RAG pipeline that returned irrelevant documents or an agent that called the wrong tool, this level of granularity is invaluable.
Pricing and access. LangSmith runs $39 per seat per month on the Plus plan, with usage-based pricing for traces beyond the included allotment. The Developer tier is free for up to 5,000 traces per month, which is enough for prototyping but not production. Enterprise pricing is custom and includes SSO, role-based access, and dedicated support.
Evaluation capabilities. LangSmith lets you create datasets of input/output pairs, run evaluations against them, and track results over time. You can use LLM-as-judge scoring, custom Python evaluators, or human annotation queues. The prompt playground lets you iterate on prompts and immediately run them against your test datasets to catch regressions before deploying.
Where it falls short. The biggest limitation is ecosystem coupling. If you are not using LangChain, instrumentation requires their SDK and manual span creation. It works, but you lose the "automatic everything" magic. The cost tracking exists but is basic compared to dedicated cost optimization tools. And the UI, while powerful, has a learning curve that frustrates developers who just want to search their logs quickly.
For teams already invested in the LangChain ecosystem running LangGraph agents, LangSmith is the obvious choice. The integration depth is unmatched. For everyone else, the value proposition gets murkier fast.
Braintrust: Evaluation as the Core Primitive
Braintrust approaches LLM observability from the opposite direction. Instead of starting with tracing and adding evaluation on top, Braintrust treats evaluation as the fundamental unit of work. Everything else, including logging, tracing, and the prompt playground, exists to support the evaluation workflow.
This philosophy shows up in the product design. When you open Braintrust, the first thing you see is your experiments: side-by-side comparisons of prompt versions, model configurations, or pipeline changes scored against your golden datasets. The interface makes it trivially easy to compare two configurations and see exactly where one beats the other.
AI-assisted scoring. Braintrust's standout feature is its approach to automated evaluation. Their scoring system uses LLM judges with carefully tuned prompts, but it also learns from your corrections. When you disagree with an automated score, that feedback improves future scoring. Over time, your evaluation suite gets more accurate without you writing more code. This is genuinely useful for teams that cannot afford to manually review every output but need better signal than simple string matching.
Pricing model. Braintrust charges $25 per seat per month plus usage-based pricing for evaluation runs and logged traces. The usage component scales with how many experiments you run and how much data you store. For teams running frequent evaluations across large datasets, this can add up, but the per-seat cost is lower than LangSmith.
Prompt playground and proxy. The Braintrust proxy is clever. It sits between your application and your LLM provider, capturing all requests automatically while also enabling features like caching, rate limiting, and model fallbacks. You get observability as a side effect of routing your traffic through their infrastructure. The playground integrates with your logged data, so you can replay production requests with modified prompts and immediately see how scores change.
Data management. Braintrust treats your evaluation datasets as first-class objects. You can version them, share them across projects, filter by metadata, and use them as regression test suites in CI/CD. This is where the evaluation-first philosophy pays off. Teams that treat prompt engineering as a data science workflow (hypothesis, experiment, measure, iterate) find Braintrust's model maps perfectly to how they already think.
The tracing UI exists and works fine, but it is not as detailed or visually rich as LangSmith's chain visualization. If your primary need is debugging complex agent interactions, Braintrust will feel like it is optimized for the wrong workflow.
Arize Phoenix: Open Source Foundation, Enterprise Scale
Arize takes a third approach entirely. Their open-source Phoenix library gives you local-first LLM observability with no vendor dependency. You can run Phoenix on your laptop, trace your LLM calls, visualize embeddings, and run evaluations without sending any data to a third-party server. For teams with strict data residency requirements or those who refuse to pipe production prompts through external services, this is the only viable starting point.
Phoenix is built on OpenTelemetry, which means your traces are portable. If you later decide to switch to a different backend, your instrumentation code stays the same. This is a meaningful architectural advantage over LangSmith and Braintrust, which use proprietary SDKs that create switching costs.
Embeddings analysis and drift detection. Arize's heritage is in ML observability (tabular models, computer vision, NLP classifiers), and that expertise shows in their embeddings tooling. You can visualize your RAG retrieval embeddings in a UMAP projection, identify clusters of queries that retrieve poorly, and detect when your embedding distributions drift from your training data. Neither LangSmith nor Braintrust offers anything comparable for understanding retrieval quality at a geometric level.
Enterprise Arize platform. The commercial Arize platform adds managed infrastructure, team collaboration, alerting, dashboards, and integrations with your existing ML monitoring stack. If your organization already uses Arize for traditional ML model monitoring (classification models, recommendation systems, fraud detection), adding LLM observability to the same platform creates a unified view across your entire AI portfolio. Enterprise pricing is custom but typically starts around $500/month for small teams and scales with data volume.
Framework compatibility. Phoenix integrates with LangChain, LlamaIndex, OpenAI's SDK, Anthropic's SDK, and any OpenTelemetry-compatible framework. The instrumentation is lightweight: a few lines of code to attach the Phoenix tracer, and all your LLM calls get captured automatically. This framework-agnostic approach makes Phoenix the best fit for teams using custom orchestration or multiple frameworks across different services.
Limitations. The open-source version lacks collaboration features, alerting, and long-term storage. You are running a local server, not a production SaaS. The evaluation tooling is functional but less polished than Braintrust's experiment-comparison interface. And the documentation, while improving, still assumes familiarity with ML observability concepts that pure software engineers might not have.
Tracing and Cost Tracking: Feature-by-Feature Breakdown
Tracing is the foundation of LLM observability. Without span-level traces showing exactly what happened during a request, you are debugging with print statements. All three platforms support tracing, but the implementation details matter more than the checkbox.
LangSmith tracing. Captures full execution graphs with parent-child span relationships. Every LangChain component (prompt template, retriever, tool, LLM call, output parser) gets its own span with timing, inputs, and outputs. Non-LangChain code requires manual span creation using their SDK's @traceable decorator or context managers. Token counts are captured per-LLM-call with cost calculated based on the model's published pricing. The latency breakdown shows time-to-first-token and total generation time separately.
Braintrust tracing. The proxy-based approach captures requests at the HTTP level, which means you get traces automatically for any provider that goes through their proxy. Span nesting is supported for custom logic via their SDK. Token counts and costs are calculated automatically, with the ability to set budget alerts per project or environment. The cost tracking includes a spend-over-time view that is genuinely useful for catching unexpected spikes before they blow your budget.
Arize Phoenix tracing. Built on OpenTelemetry spans, which means standard tooling works. You can export traces to Jaeger, Zipkin, or any OTLP-compatible backend in addition to the Phoenix UI. Token counting works for major providers. Cost tracking requires configuration of your negotiated pricing (especially relevant for enterprise contracts where per-token costs differ from published rates). Drift detection on trace metadata lets you alert when request patterns change significantly.
Cost optimization recommendations. This is where the platforms diverge sharply. LangSmith shows you cost data but does not suggest optimizations. Braintrust's experiment framework naturally surfaces cost differences when you A/B test models (try claude-3-haiku vs claude-sonnet-4-5 and see the cost/quality tradeoff immediately). Arize's enterprise platform includes cost anomaly detection that flags requests costing more than a configurable threshold. None of them automatically optimize your token usage, but Braintrust gets closest by making cost a first-class metric in every experiment.
For a comprehensive look at what to track and how to build your observability pipeline, see our AI observability guide which covers the full production stack.
Evaluation Workflows: Where the Real Differentiation Lives
Tracing tells you what happened. Evaluation tells you whether it was good. This is where the three platforms differ most, and where your choice of tool has the biggest impact on your team's velocity.
LangSmith evaluation. You create datasets (collections of input/expected-output pairs), define evaluators (Python functions, LLM judges, or human annotators), and run your chain against the dataset. Results are tracked over time so you can see if a prompt change improved or regressed quality. The human annotation queue is well-designed for teams that need subject matter experts to grade outputs. The integration with the LangSmith playground means you can go from "this trace looks wrong" to "let me try a different prompt" to "run it against my test suite" without leaving the UI.
Braintrust evaluation. This is Braintrust's home turf. Experiments are the core abstraction: you define a task, a dataset, and scoring functions, then run variations. The comparison view shows you exactly which examples improved, which regressed, and by how much. AI-assisted scoring means you can stand up a quality evaluation in minutes rather than days. The scoring learns from your corrections, which solves the cold-start problem of "I do not have enough labeled data to build an eval suite." You can run evaluations in CI/CD to gate deployments on quality thresholds, which is critical for teams shipping weekly.
Arize Phoenix evaluation. Phoenix supports LLM-as-judge evaluation with pre-built templates for common tasks (relevance, coherence, toxicity, hallucination detection). The evaluation framework is modular: you can mix automated scoring with retrieval metrics (NDCG, precision@k) for RAG applications. The open-source nature means you can customize evaluation logic without limits, but you also do not get the polished experiment-comparison UI that Braintrust provides out of the box.
For teams building serious evaluation pipelines, our guide on LLM evaluation strategies covers the methodology side: how to pick metrics, build golden datasets, and avoid the common pitfalls of LLM-as-judge scoring.
Regression detection. All three platforms support tracking scores over time, but the implementation varies. LangSmith shows a time-series chart of evaluation results. Braintrust compares experiments explicitly (version A vs version B with statistical significance). Arize alerts when scores drift below a threshold. For teams deploying frequently, Braintrust's experiment-gating approach catches regressions before they reach production, which is strictly better than detecting them after the fact.
Integration and Developer Experience
The best observability platform is the one your team actually uses. Developer experience, SDK quality, and integration friction determine adoption more than any feature comparison matrix.
LangSmith SDK. If you use LangChain, integration is one environment variable (LANGCHAIN_TRACING_V2=true). That is it. Every chain, agent, and tool call gets traced automatically. For non-LangChain code, you use the langsmith Python SDK with decorators or context managers. The SDK is well-maintained, typed, and documented. TypeScript support exists and is functional but lags behind Python. The main friction point: if you are not using LangChain, you are writing more boilerplate than you would with Braintrust's proxy approach.
Braintrust SDK and proxy. The proxy approach is elegant for tracing: point your OpenAI/Anthropic client at the Braintrust proxy URL, and everything gets captured. No SDK needed for basic tracing. For evaluations, the Braintrust SDK (Python and TypeScript) provides a clean API for defining experiments, datasets, and scorers. The DX is good. The documentation is thorough with realistic examples. The main pain point: the proxy adds a network hop, which adds latency (typically 10-30ms). For latency-sensitive applications, this matters.
Arize Phoenix SDK. Phoenix uses OpenInference, an extension of OpenTelemetry for AI. Instrumentation packages exist for LangChain, LlamaIndex, OpenAI, Anthropic, and more. Setup is a few lines of code to register the tracer provider. Because it is OpenTelemetry under the hood, you can use standard OTLP exporters to send data wherever you want. The DX is solid for Python developers familiar with the observability ecosystem. The main drawback: running the Phoenix server locally adds operational overhead that the SaaS platforms handle for you.
CI/CD integration. Braintrust has the strongest CI/CD story with native GitHub Actions support for running evaluations on pull requests. LangSmith supports CI evaluations through their SDK but requires more custom scripting. Phoenix evaluations can run anywhere Python runs, but you are responsible for the infrastructure to store and compare results.
Multi-language support. All three have strong Python SDKs. TypeScript support is best in Braintrust, good in LangSmith, and emerging in Phoenix. If your backend is Go, Rust, or Java, Phoenix's OpenTelemetry foundation gives you the best path forward since OTLP exporters exist for every major language.
Which Platform Fits Your Team
After deploying all three across different projects, the recommendation matrix is clear. Your choice depends on three factors: your orchestration framework, your primary workflow (debugging vs evaluation), and your data residency requirements.
Choose LangSmith if: You are building with LangChain or LangGraph and want zero-friction observability. Your primary pain is debugging complex agent chains. You need human annotation workflows for domain experts to grade outputs. You are comfortable with $39/seat/month and vendor coupling to the LangChain ecosystem. Teams using LangSmith typically get value within hours because the automatic instrumentation captures everything immediately.
Choose Braintrust if: Your primary workflow is iterating on prompts and measuring quality. You run frequent A/B tests on model configurations, prompt templates, or retrieval strategies. You want CI/CD-gated deployments that block on quality regressions. You value the proxy approach for low-friction tracing across any LLM provider. At $25/seat plus usage, it is the most cost-effective option for evaluation-heavy teams. Braintrust works best when you already have a hypothesis-driven development culture.
Choose Arize Phoenix if: You have strict data residency requirements and cannot send prompts to a third party. You already use Arize for traditional ML monitoring and want a unified platform. You want OpenTelemetry-native instrumentation that avoids vendor lock-in. You need embeddings analysis and drift detection for RAG quality monitoring. Your team includes ML engineers who are comfortable running infrastructure. Phoenix is free to start and the enterprise platform scales with you.
The hybrid approach. Some teams use Braintrust for evaluation and CI/CD quality gating while running Phoenix locally for development-time debugging. Others use LangSmith for production tracing but run Braintrust experiments for prompt optimization. These combinations work because the tools solve different primary problems. Do not feel locked into choosing exactly one.
What we recommend to clients. For most product-focused teams shipping AI features (not ML research labs), Braintrust delivers the fastest path to measurable quality improvement. The evaluation-first workflow forces you to define "good" before you ship, which prevents the most common failure mode: deploying prompts that feel right but have not been rigorously tested. For infrastructure-heavy teams with existing observability stacks, Phoenix's OpenTelemetry foundation integrates cleanly without adding another SaaS to your vendor list.
Regardless of which platform you pick, the critical step is picking one and instrumenting your LLM calls before you have a production incident that you cannot debug. Every week you run without observability is a week where silent quality degradation compounds unchecked.
If you are evaluating LLM observability platforms for a production deployment and want help choosing the right stack for your architecture, book a free strategy call with our AI engineering team. We will map your requirements to the platform that fits and help you avoid the integration pitfalls we have already navigated.
Need help building this?
Our team has launched 50+ products for startups and ambitious brands. Let's talk about your project.