What MLOps Actually Means in 2026
MLOps started as "DevOps for machine learning." In 2023, that mostly meant automating Jupyter notebooks and deploying scikit-learn models behind Flask endpoints. The field has shifted dramatically. In 2026, MLOps is really two disciplines merged together: traditional ML operations (data pipelines, feature stores, model training) and LLMOps (prompt management, fine-tuning workflows, inference optimization, evaluation pipelines). If your MLOps stack does not handle both, you are already behind.
The core problem has not changed, though. Getting a model from a data scientist's laptop into production, keeping it healthy, and retraining it when the world shifts is still brutally hard. Google's original "Hidden Technical Debt in Machine Learning Systems" paper from 2015 nailed it: the ML code is the smallest box in the diagram. Everything around it (data collection, feature extraction, serving infrastructure, monitoring) is where the real complexity lives.
What has changed is the tooling. You no longer need to build everything from scratch. Open-source platforms like MLflow, Kubeflow, and ZenML handle orchestration. Managed services like AWS SageMaker, Google Vertex AI, and Azure ML provide end-to-end pipelines. The challenge is choosing the right stack for your team size, budget, and use case, then actually wiring it together so it works reliably.
This guide walks you through building a production MLOps pipeline from zero. Not a toy demo. A real system that handles data versioning, experiment tracking, model packaging, automated evaluation, deployment, monitoring, and cost control. Whether you are shipping traditional ML models or fine-tuned LLMs, the architecture patterns are the same.
Data Versioning and Feature Engineering: The Foundation
Every MLOps pipeline starts with data, and every production ML failure traces back to data problems. If you cannot reproduce a training run from six months ago because the dataset changed, you have a pipeline that looks professional but is fundamentally broken. Data versioning is not optional.
DVC (Data Version Control) is still the most popular open-source option. It works like Git for data: you track pointers (small .dvc files) in your repo, and the actual datasets live in S3, GCS, or Azure Blob Storage. The workflow feels natural to engineers. You run dvc push and dvc pull just like git push and pull. For teams under 20, this is usually the right choice. It costs nothing beyond your cloud storage bill.
LakeFS takes a different approach. It gives you Git-like branching and commits directly on your data lake. You can create a branch of your S3 bucket, experiment with transformations, and merge back. This is powerful for larger teams where multiple data engineers are touching the same datasets. The managed version starts around $500/month.
For LLM-specific workflows, you also need to version your prompts, evaluation datasets, and fine-tuning data. Tools like Humanloop and PromptLayer handle prompt versioning, but honestly, most teams are better off storing prompts as code in their main repo and using DVC for the larger datasets.
Feature stores are worth the investment once you have more than three models sharing features. Feast (open-source) or Tecton (managed) let you define features once, compute them on a schedule, and serve them at low latency during inference. The classic example: you compute "user's average order value over the last 30 days" in your training pipeline. Without a feature store, you recompute it differently in your serving pipeline and get training-serving skew. With a feature store, the same feature definition runs in both places. Tecton pricing starts around $2,000/month. Feast on your own infrastructure costs whatever your compute bill is, typically $200 to $800/month for a small setup.
The key principle: every training run should be reproducible from a commit hash. That commit hash should point to exact versions of code, data, config, and dependencies. If you get this right, everything downstream becomes easier.
Experiment Tracking and Model Registry
Experiment tracking is where most teams start their MLOps journey, and for good reason. Without it, you are comparing model performance by scrolling through terminal output or, worse, asking a colleague "which run had the best F1 score?"
MLflow remains the default choice for experiment tracking. It is open-source, framework-agnostic, and integrates with everything. You log parameters, metrics, and artifacts with a few lines of code. The UI lets you compare runs side by side. And the model registry component lets you tag models as "staging" or "production" and track lineage. Self-hosting MLflow on a small VM costs about $50 to $100/month. Databricks offers a managed version bundled with their platform.
Weights & Biases (W&B) is the premium alternative. The experiment tracking is better (richer visualizations, better collaboration features, sweep/hyperparameter search built in), and their Artifacts feature handles model and dataset versioning cleanly. The free tier covers individuals and small teams. Paid plans start at $50/seat/month. For teams doing heavy experimentation, W&B pays for itself in time saved.
For LLMOps specifically, you need more than traditional experiment tracking. Fine-tuning runs look similar to classical ML training (you are still logging loss curves and evaluation metrics), but prompt engineering workflows need a different kind of tracking. You are iterating on system prompts, few-shot examples, retrieval strategies, and model versions simultaneously. Tools like Braintrust, Langfuse, and LangSmith are purpose-built for this. They log full prompt/completion pairs, let you run automated evaluations against test cases, and track quality over time.
The model registry is the bridge between experimentation and deployment. Think of it as a catalog of trained models with metadata: who trained it, on what data, with what hyperparameters, what its evaluation metrics were, and whether it is approved for production. MLflow's built-in registry works for most teams. If you are on AWS, SageMaker Model Registry integrates tightly with SageMaker endpoints. On GCP, Vertex AI Model Registry does the same.
A practical setup: use W&B or MLflow for experiment tracking, store model artifacts in your cloud provider's blob storage (S3/GCS), and use the model registry to control promotion from dev to staging to production. Every model in the registry should link back to the exact experiment run, dataset version, and code commit that produced it. No exceptions.
CI/CD for Machine Learning: Not the Same as Software CI/CD
Software CI/CD is well understood. Push code, run tests, deploy. ML CI/CD is trickier because you have three things that can change independently: code, data, and model configuration. A code change might not require retraining. A data change almost certainly does. A hyperparameter tweak needs retraining and re-evaluation but no code review. Your CI/CD pipeline needs to handle all three triggers differently.
The pipeline stages you need:
- Data validation. When new data arrives (or on a schedule), validate schema, check for distribution drift, flag anomalies, and run data quality tests. Great Expectations or Pandera work well here. If validation fails, block the pipeline and alert the team.
- Training. Triggered by data changes, code changes, or manual request. Pull versioned data, train the model, log everything to your experiment tracker. This should run on dedicated compute (GPU instances, SageMaker training jobs, or Kubernetes with GPU nodes).
- Evaluation. Automatically run your evaluation suite after training. For classical ML, this means holdout set metrics. For LLMs, this means running your eval dataset through the model and checking quality scores. If metrics regress below your threshold, block promotion.
- Model packaging. Containerize the model with its dependencies. Use Docker with pinned versions of everything. The image should be self-contained: model weights, inference code, preprocessing logic, all baked in.
- Staging deployment. Deploy to a staging environment that mirrors production. Run integration tests, load tests, and shadow traffic if possible.
- Production deployment. Canary or blue-green deployment. Route a small percentage of traffic to the new model, compare metrics against the incumbent, and promote or roll back automatically.
Orchestration tools: GitHub Actions can handle simple ML pipelines. For more complex DAGs, use Kubeflow Pipelines (if you are on Kubernetes), AWS Step Functions (if you are all-in on AWS), or Prefect/Dagster for a cloud-agnostic approach. Kubeflow Pipelines is powerful but has a steep learning curve and requires a Kubernetes cluster. If your team is not already running Kubernetes, the operational overhead is significant. Dagster is a lighter-weight alternative that handles ML pipelines well and runs anywhere.
One pattern I see teams skip and then regret: model approval gates. Before any model reaches production, it should pass automated evaluation checks and get a human sign-off. This does not need to be bureaucratic. A Slack notification with the eval report and a thumbs-up reaction is enough. But removing humans from the loop entirely leads to silent regressions that compound for weeks before anyone notices.
Deployment Strategies and Serving Infrastructure
You trained your model. It passed evaluation. Now you need to serve it. The serving layer is where many teams underinvest, and it bites them hard when traffic scales or latency requirements tighten.
For traditional ML models (classification, regression, ranking), you have a few options. BentoML and Ray Serve let you wrap models in a serving framework, containerize them, and deploy to any cloud. TorchServe and TensorFlow Serving are framework-specific but highly optimized. For most teams, the simplest path is packaging the model in a FastAPI container and deploying it to your existing Kubernetes cluster or a managed container service like ECS or Cloud Run. Do not overthink this. A well-configured FastAPI app can serve thousands of requests per second for typical ML workloads.
For LLM inference, the game is completely different. You are either calling an API (OpenAI, Anthropic, Google) or self-hosting an open model (Llama, Mistral, Qwen). If you are calling an API, your "serving infrastructure" is really a proxy layer that handles rate limiting, caching, fallbacks between providers, and cost tracking. LiteLLM is the most popular proxy. It normalizes the API across providers and gives you a single endpoint. For self-hosting, vLLM is the standard for serving open models with high throughput. It uses PagedAttention for efficient GPU memory management and supports continuous batching. Running a single A100 with vLLM for a 70B parameter model costs roughly $2 to $4/hour on major clouds.
Deployment patterns that matter:
- Canary deployments. Route 5% of traffic to the new model. Compare latency, error rates, and business metrics against the baseline. If everything looks good after 30 minutes, ramp to 25%, then 50%, then 100%. Argo Rollouts on Kubernetes or AWS CodeDeploy both support this natively.
- Shadow deployments. Run the new model alongside the old one. Send real traffic to both, but only return responses from the old model. Compare the outputs offline. This is the safest option for high-stakes models (fraud detection, medical, financial) but doubles your compute cost.
- A/B testing. Different from canary. Here you are comparing two models against a business metric (conversion rate, engagement, revenue). This requires proper experiment infrastructure: consistent user bucketing, statistical significance calculations, and guardrail metrics.
Whichever strategy you pick, you need automatic rollback. If the new model's error rate exceeds a threshold, or latency spikes beyond your SLA, the system should revert to the previous version without human intervention. This is where teams that treat the jump from prototype to production as a simple deployment get burned. Production is not a destination. It is an ongoing operation.
Monitoring, Drift Detection, and Automated Evaluation
A model that performs well on launch day will degrade. The only question is how fast. User behavior changes. The world changes. Upstream data sources change. Without monitoring, you will not know your model is broken until a customer complains or revenue dips, and by then the damage is already done.
What to monitor:
- Infrastructure metrics. Latency (p50, p95, p99), throughput, error rates, GPU utilization, memory usage. This is standard APM. Datadog, Grafana, or your cloud provider's monitoring handles it.
- Model performance metrics. Accuracy, precision, recall, F1 on a labeled holdout set. For LLMs: quality scores from automated evals, hallucination rates, refusal rates, tool call success rates. See our guide on AI observability for production for the full breakdown.
- Data drift. Statistical tests comparing the distribution of incoming features against the training distribution. Population Stability Index (PSI) and Kolmogorov-Smirnov tests are the classics. Evidently AI (open-source) generates beautiful drift reports and integrates with most MLOps stacks. NannyML is another strong option focused on performance estimation without ground truth labels.
- Concept drift. This is harder. Data drift means the inputs changed. Concept drift means the relationship between inputs and outputs changed. Your features look the same, but the correct predictions are different. The only way to catch this is continuous evaluation against ground truth labels, which means you need a labeling pipeline running alongside your model.
Automated evaluation loops are the single most impactful thing you can add to your pipeline. For classical ML, schedule weekly eval runs against fresh labeled data. For LLMs, run your evaluation suite (a curated set of test prompts with expected outputs) on every deployment and on a daily schedule. If scores drop below your threshold, trigger an alert. If they drop significantly, trigger automatic rollback.
Evidently AI deserves a closer look. Their open-source library generates data quality, drift, and model performance reports as HTML dashboards or JSON for programmatic consumption. You can plug it into your CI pipeline to run checks on every data batch. The managed platform (Evidently Cloud) adds alerting and historical tracking. Pricing starts at $500/month for small deployments. For many teams, the open-source version plus a cron job and Slack alerts is enough.
One critical pattern: feedback loops. Collect user feedback on model outputs (thumbs up/down, corrections, implicit signals like "user regenerated the response"). Feed this back into your evaluation and retraining pipelines. Models that learn from production feedback improve over time. Models that do not learn stagnate and eventually fail. Building this feedback infrastructure early, even if it is simple, pays enormous dividends over 6 to 12 months.
Cost Management and Putting It All Together
MLOps infrastructure is expensive if you are not paying attention. GPU training costs, inference compute, data storage, experiment tracking SaaS, monitoring tools: it adds up fast. I have seen startups spend $30,000/month on ML infrastructure that could be reduced to $8,000 with better architecture choices.
Where the money goes:
- Training compute. The biggest line item for teams training their own models. A single A100 GPU on AWS costs roughly $3.50/hour on-demand. A training run that takes 8 hours costs $28. If you are running 10 experiments a week, that is $1,100/month on training alone. Use spot instances (60 to 70% cheaper) for fault-tolerant training jobs. Implement checkpointing so interrupted jobs can resume.
- Inference compute. For self-hosted models, this runs 24/7. A single GPU instance for serving costs $2,500 to $5,000/month on-demand. Use autoscaling aggressively. If your traffic drops to near zero at night, scale to zero and cold-start on demand. Serverless GPU options (Modal, Replicate, Banana) are excellent for bursty workloads.
- API costs. If you are using hosted LLMs, your costs scale with usage. Track token consumption per user, per feature, per prompt. Set hard budgets and alerts. Implement caching for repeated queries. Semantic caching (caching responses for semantically similar queries) can reduce API costs by 20 to 40%.
- Storage. Model artifacts, training data, logs, and evaluation results. S3 costs about $23/TB/month. Use lifecycle policies to move old data to cheaper tiers (S3 Glacier, $4/TB/month).
- Tooling. W&B ($50/seat), Evidently Cloud ($500/month), feature stores ($2,000/month for Tecton), orchestration platforms. These add up. Audit your tool spend quarterly and cut anything your team is not actively using.
A realistic budget for a small team (3 to 5 ML engineers): $3,000 to $8,000/month for a well-optimized stack. $10,000 to $25,000/month if you are training large models or self-hosting LLM inference. These numbers assume you are using spot instances, autoscaling, and not leaving idle resources running.
Putting the full pipeline together, here is the architecture I recommend for most teams starting from scratch. Version your data with DVC. Track experiments with W&B or MLflow. Store models in your cloud provider's registry. Orchestrate pipelines with Dagster or GitHub Actions. Deploy with containers on Kubernetes or a managed service. Monitor with Evidently for drift and your existing APM for infrastructure. Run automated evals on every deployment and on a daily schedule. Collect user feedback and feed it back into retraining.
Start simple. Do not try to build the entire pipeline on day one. Start with experiment tracking and a basic CI/CD pipeline that trains, evaluates, and deploys. Add data versioning when you have more than one person touching the data. Add drift detection when you have been in production for a month and have a baseline. Add automated retraining when you have enough feedback data to make it worthwhile. The teams that succeed with MLOps are the ones that build incrementally and ship value at each step, not the ones that spend three months building the perfect platform before deploying a single model.
Ready to build your production MLOps pipeline?
We help teams design, build, and operationalize ML infrastructure that scales. Whether you are deploying your first model or modernizing an existing pipeline, we can accelerate the process and help you avoid the expensive mistakes. Book a free strategy call and let's map out the right architecture for your team.
Need help building this?
Our team has launched 50+ products for startups and ambitious brands. Let's talk about your project.