---
title: "LangSmith vs Arize vs WhyLabs: LLM Monitoring for Production 2026"
author: "Nate Laquis"
author_role: "Founder & CEO"
date: "2030-02-06"
category: "Technology"
tags:
  - LLM monitoring
  - LangSmith
  - Arize
  - WhyLabs
  - AI observability
excerpt: "73% of LLM apps in production still lack adequate monitoring. LangSmith, Arize, and WhyLabs each attack this problem from a different angle. Here is what matters when you are choosing between them."
reading_time: "14 min read"
canonical_url: "https://kanopylabs.com/blog/langsmith-vs-arize-vs-whylabs"
---

# LangSmith vs Arize vs WhyLabs: LLM Monitoring for Production 2026

## Why most LLM apps are flying blind in production

A recent survey of AI engineering teams found that 73% of LLM-powered applications in production lack adequate monitoring. That number is staggering, but it tracks with what I see in practice. Teams spend months fine-tuning prompts, building retrieval pipelines, and wiring up agent loops, then deploy the whole thing behind a single health-check endpoint that returns 200 as long as the process is alive. The model could be hallucinating wildly, latency could be spiking on 10% of requests, costs could be doubling week over week, and nobody would know until a user screenshots the garbage output and posts it on social media.

Traditional APM tools like Datadog and New Relic were not built for this. They can tell you that your /chat endpoint returned a 200 in 1.2 seconds. They cannot tell you that the response was factually wrong, that the retrieval step pulled irrelevant documents, that the model consumed 14,000 tokens when 3,000 would have sufficed, or that the embedding distribution has drifted since last Tuesday. LLM monitoring is a distinct discipline because the failure modes are semantic, not structural. A perfectly healthy HTTP response can contain a perfectly terrible answer.

Three platforms have emerged as serious contenders for this problem: LangSmith, Arize, and WhyLabs. Each grew out of a different lineage and carries a different philosophy about what "monitoring" means for generative AI. LangSmith comes from the LangChain ecosystem and treats tracing and prompt iteration as the core loop. Arize comes from traditional ML observability and extends its drift detection and embedding analysis to LLM workloads. WhyLabs comes from data quality and open-source profiling, betting that guardrails and statistical profiles are the right primitives. Choosing between them is less about feature checklists and more about which mental model matches how your team thinks about production quality.

![Analytics dashboard displaying real-time monitoring metrics and performance charts](https://images.unsplash.com/photo-1460925895917-afdab827c52f?w=800&q=80)

In this comparison, I will go deep on each platform's architecture, trace visualization, cost tracking, latency analysis, quality scoring, and data retention. I will also share the decision framework I use when helping teams pick a monitoring stack. If you want broader context on what production observability looks like across the full AI lifecycle, our guide on [AI observability for production](/blog/ai-observability-for-production) covers the territory.

## LangSmith: tracing and iteration for the LangChain ecosystem

LangSmith is built by LangChain, Inc. and was originally designed as the companion tool for teams building with LangChain and LangGraph. Over the past two years it has broadened its scope significantly, but the DNA is still visible: tracing is the central artifact, and the product is at its best when you are debugging, iterating, and evaluating LLM chains in a tight feedback loop.

The tracing model is excellent. Every LLM call, retrieval step, tool invocation, and chain execution is captured as a span within a trace. The visualizer renders these as a nested tree, and you can click into any span to see the full input, output, token counts, latency, and model parameters. For agent workflows built with LangGraph, the trace view shows each node in the graph, the state transitions, and the branching logic. This is genuinely hard to replicate outside the LangChain ecosystem, because LangSmith understands the framework's internal abstractions. If your orchestration layer is LangChain, the tracing you get is richer than what any generic tool can provide.

The prompt playground is another strength. You can pull a failing production trace into the playground, modify the prompt or the model parameters, and rerun it against the same inputs. You can also run it against a saved dataset and score the results using built-in evaluators or custom LLM-as-judge functions. This creates a workflow where debugging, experimentation, and evaluation happen in a single surface rather than scattered across notebooks, scripts, and dashboards. Teams that adopt this loop report significantly faster iteration cycles.

On the monitoring side, LangSmith provides dashboards for latency percentiles (p50, p95, p99), token usage, cost per trace, error rates, and feedback scores. You can filter by tags, metadata, or time ranges, and set up alerts when metrics breach thresholds. The cost tracking is accurate for OpenAI, Anthropic, Google, and most major providers, pulling from a maintained pricing catalog. For teams that need to attribute cost per customer or per feature, tagging traces with the right metadata gives you that breakdown.

Where LangSmith is less strong is outside the LangChain world. The Python and TypeScript SDKs work with any LLM call, not just LangChain, but the experience is noticeably less polished. You lose the automatic instrumentation, the graph-aware tracing, and some of the playground integrations. If your orchestration is custom-built or uses a different framework like Haystack or LlamaIndex, LangSmith still works, but you are doing more manual instrumentation, and the value proposition weakens relative to the alternatives.

Pricing starts at $39/month for the Plus plan with 5,000 free traces per month, scaling with usage beyond that. The Developer plan is free with limited traces. Enterprise pricing includes SSO, advanced data retention, and dedicated support. Compared to the other two tools in this comparison, LangSmith is the most affordable entry point for small teams, but costs can scale quickly for high-volume production workloads with millions of traces per month.

## Arize: ML observability extended to LLMs

Arize was an ML observability company before LLMs existed. Its original product helped teams monitor traditional ML models in production: detecting feature drift, tracking prediction quality, identifying data quality issues, and surfacing model degradation before it became a business problem. When LLMs took over, Arize extended that infrastructure to cover generative AI workloads, and the result is a platform with unusual depth in areas that LangSmith and WhyLabs treat more lightly.

The standout capability is embedding drift detection. Arize can ingest the embeddings from your prompts and responses, project them into a lower-dimensional space using UMAP, and let you visually explore clusters of similar interactions. When the distribution shifts, meaning users start asking questions your system was not designed for, or the model starts producing outputs that cluster differently than they did last week, Arize flags it. This is genuinely useful for catching subtle degradation that keyword-based monitoring would miss. A retrieval-augmented generation system might return technically relevant documents that are semantically off-target, and Arize's embedding analysis can surface that pattern before your quality scores tank.

![Modern data center with rows of illuminated server racks](https://images.unsplash.com/photo-1558494949-ef010cbdcc31?w=800&q=80)

Arize also offers strong trace visualization through its Arize Phoenix product, which is open source and can run locally or in the cloud. Phoenix provides a trace viewer with nested spans, latency breakdowns, token counts, and retrieval metrics. It supports OpenTelemetry-based instrumentation, so you can feed it data from LangChain, LlamaIndex, or custom code without vendor lock-in on the SDK side. The relationship between Phoenix (open source, developer-focused) and the Arize platform (commercial, team-focused) is similar to the Langfuse model: start free and local, graduate to the hosted product when you need collaboration, alerting, and retention.

On quality scoring, Arize provides both automated evaluators and support for human annotation. You can run LLM-as-judge evaluations, track hallucination rates using their built-in evaluators, and monitor retrieval relevance scores over time. The dashboard can slice these metrics by model version, prompt template, user segment, or any custom dimension you tag. For teams running A/B tests between model versions or prompt variants, the ability to compare quality distributions side by side is extremely valuable.

Cost tracking and latency monitoring are solid. Arize tracks per-request token usage, maps it to provider pricing, and aggregates it into dashboards with whatever grouping dimensions you need. Latency is broken down by span type, so you can see whether your bottleneck is in the LLM call, the retrieval step, or the post-processing logic. P50, p95, and p99 latency views are standard.

Pricing for the Arize platform starts with a free tier, a Teams tier around $500/month, and Enterprise pricing that varies. Phoenix is free and open source. The commercial platform is more expensive than LangSmith, but the feature set is also broader, especially if you are running traditional ML models alongside LLMs and want a single observability platform for both. Teams that come from a data science background and already think in terms of drift, distributions, and embeddings tend to feel at home immediately.

## WhyLabs: open-source profiling and guardrails

WhyLabs takes a fundamentally different approach. Rather than building around traces or embeddings, WhyLabs builds around statistical profiles. The core open-source library, whylogs, generates lightweight statistical summaries of your data at every stage of the pipeline. These profiles capture distributions, cardinality, missing values, and custom metrics without storing the raw data. For LLM workloads, this means you can monitor prompt length distributions, response token counts, toxicity scores, sentiment drift, and dozens of other signals in near real-time, at a fraction of the storage cost of full trace logging.

The profile-based architecture has a philosophical advantage: privacy. Because whylogs summarizes the data rather than storing it, you can monitor LLM interactions in regulated environments where logging full prompts and responses would create compliance headaches. Healthcare organizations, financial services companies, and government agencies find this compelling. You get statistical visibility into what your model is doing without creating a searchable database of every conversation.

WhyLabs' other major differentiator is LLM guardrails, delivered through their Guardrails product. This is not just monitoring after the fact. It is active intervention. You define policies (block responses with PII, reject prompts that attempt jailbreaks, flag outputs that exceed a toxicity threshold) and the guardrails layer evaluates every request and response against those policies in real time. When a policy triggers, the system can block the response, substitute a safe fallback, or log the event for human review. This is the closest any of the three platforms gets to being a safety layer rather than just an observability layer.

The monitoring dashboard aggregates profiles over time and shows you trends in every metric you are tracking. You can detect drift in input distributions (are users suddenly asking about topics your system was not trained for?), output distributions (are responses getting longer or more repetitive?), and quality scores (is the toxicity rate climbing?). Alerts fire when any metric deviates from its baseline by a configurable threshold. The system is designed for high-throughput workloads and can handle millions of inferences per day without the storage overhead of full trace logging.

Where WhyLabs is less strong is in the deep debugging workflow. Because profiles summarize rather than store, you cannot click into a specific failing interaction and see the full prompt, retrieved documents, and model response the way you can in LangSmith or Arize. If you need that level of detail, you need to pair WhyLabs with a separate trace logging tool, or enable selective raw data capture for flagged interactions. WhyLabs knows this and has been expanding its trace capabilities, but it is not the primary strength of the platform.

Pricing for WhyLabs starts with a free tier that covers a generous number of profiles. The paid plans scale based on the number of models and the volume of data profiled, starting around $300/month for teams. The open-source whylogs library is completely free and can be used independently. For teams that are price-sensitive on storage but need high-volume monitoring, the profile-based model can be dramatically cheaper than platforms that store every trace.

## Feature-by-feature comparison

Let me break down the six capabilities that matter most when you are evaluating LLM monitoring platforms, and how each tool handles them.

**Trace visualization.** LangSmith has the most polished trace viewer, especially for LangChain and LangGraph workloads. The nested span tree is intuitive, and the ability to jump from a trace into the playground for debugging is a workflow no other tool replicates as cleanly. Arize Phoenix offers a comparable trace viewer with OpenTelemetry compatibility, which matters if you want vendor-neutral instrumentation. WhyLabs has basic trace support but it is not the core experience. If trace debugging is your primary workflow, LangSmith or Arize should be your shortlist.

**Prompt and response logging.** Both LangSmith and Arize store full prompt/response pairs and make them searchable, filterable, and exportable. You can find every interaction where the model mentioned a competitor, or where the response exceeded 2,000 tokens, or where a specific user triggered an error. WhyLabs profiles the data rather than storing it, so you get statistical summaries but not the raw text. This is a feature for privacy-sensitive environments and a limitation for debugging-heavy workflows. Teams that need both can run WhyLabs for monitoring alongside a lighter logging tool for selective raw capture.

**Cost tracking.** All three platforms can track token usage and map it to provider pricing. LangSmith maintains an updated pricing catalog for major providers and gives you per-trace cost breakdowns. Arize does the same, with the added ability to slice cost by custom dimensions like customer tier or feature flag. WhyLabs tracks cost at the profile level, giving you aggregate cost trends and anomaly detection, but you will not see the cost of an individual request unless you enable raw logging for that interaction. For most teams, all three are adequate here. The differences show up in how you want to slice and drill into the data.

![Close-up of code on a developer monitor showing monitoring implementation](https://images.unsplash.com/photo-1461749280684-dccba630e2f6?w=800&q=80)

**Latency percentiles.** LangSmith and Arize both provide p50, p95, and p99 latency views broken down by span type. You can see whether your retrieval step is the bottleneck or whether the LLM call itself is spiking. Arize adds the ability to correlate latency with other metrics (did latency spike because input prompts got longer, or because the provider had an outage?). WhyLabs captures latency distributions in profiles and can alert on percentile shifts, but the visualization is less granular than the trace-based tools.

**Quality scoring.** This is where the tools diverge most. LangSmith integrates evaluation directly into the product: you can run LLM-as-judge scorers, track feedback scores from users, and compare quality across prompt versions. Arize offers automated evaluation with built-in hallucination and relevance detectors, plus embedding-based quality analysis that catches subtle semantic issues. WhyLabs approaches quality through guardrails and profiling, focusing on aggregate quality trends and policy enforcement rather than per-interaction scoring. For teams that care most about [evaluating LLM quality](/blog/how-to-evaluate-llm-quality) rigorously, LangSmith and Arize have the edge, while WhyLabs excels at enforcing quality boundaries at scale.

**Data retention.** LangSmith retains traces for 14 days on the Plus plan and up to 400 days on Enterprise. Arize retains data based on your plan tier, with Enterprise offering configurable retention windows. WhyLabs retains profiles indefinitely on paid plans, and because profiles are dramatically smaller than raw traces (often 1000x smaller), long-term retention is cheap. If regulatory or audit requirements demand you keep monitoring data for years, WhyLabs' profile-based approach is the most cost-effective path. If you need to query raw interactions from six months ago, LangSmith or Arize on an Enterprise plan is required.

## Architecture and integration patterns

How these tools fit into your existing stack matters as much as their feature sets. A monitoring tool that requires you to rip out your instrumentation layer or adopt a new framework is a non-starter for most teams with production traffic.

**LangSmith integration.** If you are using LangChain or LangGraph, integration is a single environment variable. Set LANGCHAIN_TRACING_V2=true, point the endpoint at LangSmith, and every chain execution is automatically traced. For non-LangChain code, the Python and TypeScript SDKs provide decorators and context managers that wrap your functions in spans. The SDK also supports OpenAI, Anthropic, and other provider clients directly, so you can get tracing without adopting LangChain as your orchestration layer. The trade-off is that LangSmith's SDK is proprietary. You are committing to their instrumentation format, and migrating to another tool later means re-instrumenting your code.

**Arize integration.** Arize supports OpenTelemetry-based instrumentation through its OpenInference specification, which means you can instrument your code with otel-compatible spans and send them to Arize, Jaeger, Grafana Tempo, or any other otel-compatible backend. This is the most portable instrumentation approach of the three. Phoenix provides auto-instrumentors for LangChain, LlamaIndex, OpenAI, and other popular libraries. The Arize platform also accepts data via its Python SDK, REST API, or file upload. For teams that already have an OpenTelemetry setup for their backend services, adding LLM tracing through Arize feels natural.

**WhyLabs integration.** whylogs integrates at the data level rather than the trace level. You call whylogs.log() with your prompt, response, and any metadata, and it generates a statistical profile that gets uploaded to the WhyLabs platform. The library has integrations for LangChain, LlamaIndex, and direct provider SDKs. Guardrails integrate as middleware in your request pipeline, typically as a wrapper around your LLM call that validates inputs and outputs against your policy definitions. The integration is lightweight, but it is a different mental model than trace-based tools. You are thinking about data distributions, not individual requests.

For teams running multiple models or providers, all three tools handle multi-model monitoring, but with different strengths. LangSmith is best at showing you the trace through a complex chain that touches multiple models (say, a classifier followed by a generator followed by a summarizer). Arize is best at comparing the statistical behavior of different models or versions side by side. WhyLabs is best at enforcing consistent policies across all your models regardless of provider.

A pattern I see increasingly is layered monitoring: WhyLabs for guardrails and high-level statistical monitoring, plus LangSmith or Arize for deep trace debugging on flagged interactions. This gives you the cost efficiency and privacy benefits of profile-based monitoring at the top layer, with the ability to drill down into specific interactions when the profiles indicate a problem. The overhead is managing two tools, but the coverage is more complete than any single tool provides. For guidance on structuring this kind of layered approach, our post on [AI observability for production](/blog/ai-observability-for-production) covers the architectural patterns in detail.

## Which tool fits your team

The right choice depends on where your team came from, what your production workload looks like, and which failure modes keep you up at night. Here is the framework I use when advising teams.

**Choose LangSmith if** your orchestration layer is LangChain or LangGraph. The automatic instrumentation, framework-aware trace visualization, and integrated playground create a developer experience that no generic tool can match for that ecosystem. Also choose LangSmith if your primary pain point is prompt iteration speed. The workflow from "find a failing trace" to "edit the prompt" to "run against an eval dataset" to "deploy the fix" is tighter here than anywhere else. Teams of 2 to 15 engineers shipping LLM features iteratively will get the most value.

**Choose Arize if** your team has a data science or ML engineering background and thinks naturally in terms of distributions, drift, and embeddings. Arize is also the right pick if you run traditional ML models alongside LLMs and want a single observability platform for both. The embedding drift detection is genuinely unique and catches categories of degradation that log-based monitoring will miss. Choose Arize if you need OpenTelemetry-native instrumentation and want to avoid vendor lock-in on the SDK layer. Teams of 10 or more with dedicated ML platform engineers will extract the most value.

**Choose WhyLabs if** privacy and compliance are your top constraints. The profile-based architecture lets you monitor without storing sensitive data, which is a hard requirement in healthcare, finance, and government. Also choose WhyLabs if you need active guardrails, not just passive monitoring. The ability to block, filter, or redirect responses in real time based on policy violations is something neither LangSmith nor Arize offers as a first-class feature. Choose WhyLabs if you are running very high volume workloads (millions of inferences per day) where the storage cost of full trace logging would be prohibitive.

**Consider layering tools if** you need both guardrails and deep debugging. WhyLabs for the guardrails and statistical monitoring layer, LangSmith or Arize for the trace debugging layer, gives you comprehensive coverage. The cost of two tools is often less than the cost of one production incident that proper monitoring would have caught.

**Consider alternatives if** none of these fit. Langfuse is open source and self-hostable with excellent tracing. Helicone is a lightweight proxy-based logger that is perfect if you just want cost and latency visibility. Braintrust is eval-first and excels at regression testing in CI. Our comparison of [Braintrust vs Langfuse vs PromptLayer](/blog/braintrust-vs-langfuse-vs-promptlayer) covers the eval-focused end of the spectrum.

One final observation. The 73% of LLM apps without adequate monitoring is not a technology problem. All three tools covered here can be integrated in a single afternoon. It is a prioritization problem. Teams treat monitoring as something they will add after launch, and then the post-launch feature pressure never lets up. The teams that avoid production incidents are the ones that treat monitoring as a launch requirement, not a post-launch nice-to-have. If you are building an LLM-powered product and want help choosing and integrating the right monitoring stack before your users discover the gaps for you, [book a free strategy call](/get-started) and we will map out the approach together.

---

*Originally published on [Kanopy Labs](https://kanopylabs.com/blog/langsmith-vs-arize-vs-whylabs)*