---
title: "How to Run LLM Evaluations: A Practical Guide for Founders in 2026"
author: "Nate Laquis"
author_role: "Founder & CEO"
date: "2027-10-19"
category: "AI & Strategy"
tags:
  - LLM evaluation
  - AI testing
  - LLM-as-judge
  - Braintrust
  - Langfuse
  - AI observability
excerpt: "If you ship an LLM feature without evals, you are flying blind. Here is the practical playbook we use to measure, regress, and actually improve production AI systems."
reading_time: "13 min read"
canonical_url: "https://kanopylabs.com/blog/how-to-run-llm-evaluations"
---

# How to Run LLM Evaluations: A Practical Guide for Founders in 2026

## Why Evals Are the Only Moat That Matters

Every founder I talk to in 2026 has the same pattern. They ship a slick LLM demo, it wows investors, and then production traffic hits it and quality silently collapses. Hallucinations slip in, a model update changes behavior, a prompt tweak fixes one case and breaks five others. Nobody notices until a customer complains on LinkedIn.

The teams that win are not the ones with the fanciest prompts or the most expensive model. They are the ones who treat evaluation as a first class engineering discipline. Evals are your unit tests, your regression suite, your compass, and honestly your only real moat. Anyone can copy your prompt. Nobody can copy the six months of labeled examples and judge rubrics you built from real user behavior.

This guide is the opinionated playbook we use at Kanopy Labs when we help founders ship LLM features they can actually trust. It assumes you already have a working prototype and you are ready to take it seriously.

![Analytics dashboard showing LLM evaluation metrics and quality scores](https://images.unsplash.com/photo-1551288049-bebda4e38f71?w=800&q=80)

## Start With a Golden Dataset, Not a Benchmark

The single biggest mistake I see is teams chasing public benchmarks. MMLU scores do not tell you whether your support bot handles refund requests correctly. You need your own data.

A golden dataset is a curated set of inputs paired with either a reference output, a rubric, or both. For an LLM feature, I want to see at least 100 examples before you ship anything to production, and 500 to 1000 within the first month. Quality beats quantity, but you need enough coverage to catch edge cases.

Here is how we build them in practice:

- **Mine real traffic.** Pull anonymized logs from your prototype or staging. Real user phrasing is always weirder than what your team invents in a brainstorm.

- **Stratify by intent.** Do not just grab 100 random queries. Bucket them by user goal, difficulty, and risk level, then sample across buckets.

- **Include adversarial cases.** Prompt injections, off topic questions, ambiguous inputs, and the classic "ignore previous instructions" attempts belong in the dataset.

- **Label with domain experts, not interns.** If you are building a legal AI, a paralegal should write the reference answers. Cheap labels produce cheap evals.

- **Version it.** Treat the dataset like code. Commit it, tag it, review changes in PRs.

We cover the labeling process in more depth in our guide on [evaluating LLM quality](/blog/how-to-evaluate-llm-quality). The short version is that your dataset is a living asset. Budget for it.

## LLM-as-Judge: Powerful, Biased, and Necessary

For anything beyond exact match tasks, you cannot grade outputs with string comparison. A support reply can be correct in twenty different wordings. Enter LLM-as-judge, the technique where you use a stronger model to score the output of your production model against a rubric.

It works, but only if you respect its failure modes:

- **Position bias.** Judges prefer the first answer in a pairwise comparison. Always randomize order and run both directions.

- **Verbosity bias.** Judges reward longer answers even when shorter is better. Calibrate explicitly in your rubric.

- **Self preference.** GPT judges prefer GPT outputs. Claude judges prefer Claude. Use a different family for the judge than the system under test when possible.

- **Rubric drift.** "Is this answer good" produces noise. Break the rubric into specific dimensions: factual accuracy, tone, completeness, safety, format compliance. Score each 1 to 5 with concrete anchors.

My rule of thumb: validate the judge against human labels on at least 50 examples before trusting it. If the judge agrees with humans less than 80 percent of the time, your rubric is broken, not your model. Fix the rubric first. Only then scale up judging.

And please, do not use the same model as both the generator and the judge in the same run. That is not an evaluation, that is a model grading its own homework.

## Offline Evals vs Online Evals: You Need Both

Founders tend to do one or the other. Engineering teams love offline evals because they run in CI and produce nice numbers. Product teams love online evals because they reflect real users. Both are right, and you need a clear split between them.

**Offline evals** run against your golden dataset before code ships. They answer: did this change make things better or worse on the scenarios we already care about? They are fast, cheap per run, and deterministic enough to block merges. Every prompt change, model swap, retrieval tweak, or chain refactor should run offline evals automatically.

**Online evals** run against live production traffic. They answer: are real users getting good answers right now? This is where you catch distribution shift, new user behaviors, and the cases your golden dataset missed. Online evals typically sample a percentage of real traffic, run an LLM judge asynchronously, and feed scores into dashboards and alerts.

The handoff between the two is where the magic happens. When online evals flag a failing cluster, you mine those examples, have humans label them, and promote them into the golden dataset. Now your offline suite catches that regression next time. This flywheel is the single most important loop in production AI.

![Data pipeline visualization showing offline and online evaluation flows](https://images.unsplash.com/photo-1460925895917-afdab827c52f?w=800&q=80)

## The 2026 Tool Landscape: What to Actually Use

The eval tooling space exploded in the last two years. Here is my honest take on what is worth your time as of 2026.

**Braintrust.** The closest thing to a default choice for serious teams. Strong experiment tracking, good prompt playground, first class support for LLM-as-judge, and a solid SDK. Pricing scales with traces, which adds up, but the developer experience is the best in the category. Use this if you want one tool for prompt engineering, eval runs, and production traces.

**Langfuse.** Open source, self hostable, and excellent for observability plus evals in one stack. If you are privacy sensitive, on premise, or want to avoid vendor lock in, this is my pick. The UI is cleaner than it was a year ago and the dataset and scoring features are now on par with commercial options.

**Promptfoo.** The right tool for CI integration and quick local eval runs. YAML config, runs in GitHub Actions trivially, great for red teaming and prompt regression tests. Pair it with one of the hosted tools above. Promptfoo is the scalpel, Braintrust or Langfuse is the operating room.

**OpenAI Evals.** Still around, still maintained, but honestly it feels dated compared to the newer options. Use it if you are fully on the OpenAI stack and want minimum friction. Otherwise skip.

**ragas.** The specialized choice for RAG pipelines. It gives you concrete metrics like faithfulness, answer relevancy, context precision, and context recall. If you have a retrieval component, you need ragas or something like it. Do not try to grade retrieval quality with a generic LLM judge, the metrics ragas ships are well studied and give you actionable signals per component.

**Inspect AI, DeepEval, Arize Phoenix.** All legitimate, all have their fans. I would not start here unless you have a specific reason.

My default stack for a seed stage startup in 2026: Langfuse for observability and online evals, Promptfoo in CI, ragas for the RAG layer, and a hand rolled Python harness for custom judge logic. Total cost under 500 dollars a month at moderate volume.

## Regression Testing and CI Integration

If your evals do not block a bad PR from merging, they are decoration. The whole point is to make quality regressions visible before users see them.

Here is the pipeline we set up for every client:

- **Trigger on PRs that touch prompts, model config, or chain code.** Use path filters so you do not run a 50 dollar eval suite on a README change.

- **Run the golden dataset against the changed version.** Compare scores to the main branch baseline.

- **Fail the build on regression thresholds.** For example: block merge if any dimension drops more than 3 percent, or if any high severity case flips from passing to failing.

- **Post a comment on the PR with a diff view.** Show the examples where behavior changed, good and bad. Reviewers should see the actual outputs, not just a number.

- **Cache aggressively.** If a test case has not changed and the model version has not changed, reuse the prior score. This cuts CI cost by 80 percent.

A good eval run on a 500 example dataset should complete in under 10 minutes and cost between 2 and 15 dollars depending on model choice. If yours costs more, you are either over testing or using too expensive a judge model. A cheaper model as the judge for the CI pass, with a more expensive judge sampled on a smaller set, is usually the right move.

## Observability and the Production Feedback Loop

Offline evals tell you about yesterday. Observability tells you what is happening right now. For LLM systems in production, you need to log every request with enough context to replay it, score it, and debug it later.

At minimum, capture:

- **The full prompt,** including system message, retrieved context, few shot examples, and user input.

- **The full response,** including token counts, finish reason, and latency.

- **A session identifier** so you can reconstruct multi turn conversations.

- **User feedback signals,** whether explicit thumbs up and down or implicit signals like "did the user rephrase the question immediately after."

- **Model version, prompt version, and retrieval config** so you can bucket metrics by release.

Langfuse and Braintrust both handle this well. Whatever you pick, the non negotiable is that a support engineer can paste a user complaint into your observability tool and see the exact trace within seconds. If that workflow takes more than a minute, your production debugging loop is broken.

The next layer is running judges on sampled production traffic. Start with 5 to 10 percent sampling, score on the same dimensions as your offline rubric, and alert when rolling averages drop. This is how you catch silent regressions from things you do not control, like a provider quietly updating their model weights. It happens more often than providers admit.

![Developer monitoring production AI system traces and observability dashboards](https://images.unsplash.com/photo-1555949963-ff9fe0c870eb?w=800&q=80)

## Cost, Common Mistakes, and Where to Start

Let us talk real numbers. For a seed stage startup running a moderate volume LLM product, here is a realistic monthly eval budget in 2026:

- **Tooling:** 100 to 400 dollars for Langfuse cloud or Braintrust starter tier.

- **Judge model inference:** 300 to 1500 dollars depending on sampling rate and judge model choice.

- **Human labeling:** 500 to 2000 dollars a month if you use a service, or internal time if you do it yourselves.

- **Engineering time:** Expect one engineer to spend 15 to 25 percent of their capacity on evals during the first six months. It drops after that.

Total realistic spend: 1000 to 4000 dollars a month to do this right at early stage. If that sounds steep, compare it to the cost of a single quality incident that torches customer trust. Evals are cheap insurance.

The most common mistakes I see:

- **Waiting too long to start.** Build the first 50 example eval before your second prompt iteration, not after your tenth.

- **Grading with a single number.** A composite "quality score" hides everything useful. Break it into dimensions.

- **Using the same model for generation and judging.** Already covered, still the most common mistake.

- **Never updating the rubric.** Your rubric should evolve every month based on what you learn from production.

- **Treating evals as a QA team task.** Evals belong to the engineers shipping the feature. Ownership matters.

If you are picking between investing in evals and investing in a fancier model or a fine tuning run, pick evals every time. Without evals you cannot even tell whether the fancier model helped. We go deeper on this tradeoff in our piece on [fine tuning vs RAG vs prompt engineering](/blog/fine-tuning-vs-rag-vs-prompt-engineering), but the short answer is that evals unlock every other optimization.

Start tomorrow. Build a 50 example dataset by hand this week. Write a rubric with three dimensions. Run Promptfoo locally against your current prompt. You will find bugs in the first hour. That is the point.

If you want help designing an eval harness that actually fits your product and your team, we do this work every day. [Book a free strategy call](/get-started) and we will walk through your system and sketch the eval plan with you.

---

*Originally published on [Kanopy Labs](https://kanopylabs.com/blog/how-to-run-llm-evaluations)*