The Case for Closing the OpenAI Tab
For most of 2023 and 2024, "use the OpenAI API" was the right answer for almost every founder. Open source models existed but were measurably worse than GPT-4. By late 2024, the gap had narrowed for many tasks. By 2026, Llama 3.3 70B, Llama 3.4 405B, Mistral Large 2, Qwen 2.5, and DeepSeek V3 are all capable enough that "self-host an open model" is a real option for the right use cases.
Self-hosting changes the cost curve dramatically. A workload that costs $80,000 per month on Claude or GPT-4 might cost $12,000 per month on a properly tuned Llama deployment. At $1M+ in annual LLM spend, self-hosting can save more money than your LLM team costs. Below that, the math is harder.
This article is the honest framework for deciding when self-hosting wins, when APIs win, and how to think about the decision without getting seduced by the headline cost savings.
What Self-Hosting Actually Means in 2026
"Self-hosting" is a spectrum. Pick where you sit before you compare to API costs.
- Pay-per-token open model API. Together.ai, Fireworks, Anyscale, Groq, DeepInfra, OpenRouter, Replicate. You call an API but it serves Llama, Mistral, or another open model. No infrastructure to manage. Significantly cheaper than GPT-4 or Claude per token, with the trade-off that you do not own the deployment.
- Managed GPU hosting. AWS Bedrock (for Llama, Mistral, Titan), Google Vertex (for Gemini, plus self-deployed open models), Azure ML, OctoAI, Modal, RunPod. You deploy a model on someone else's GPUs, paying for the GPU time. More control than per-token APIs, more work than them.
- Self-managed cloud GPUs. Rent GPUs on AWS, GCP, Azure, Lambda Labs, CoreWeave. Deploy your own inference stack (vLLM, Text Generation Inference, SGLang, Triton). Maximum control, maximum work.
- Owned hardware. Buy your own H100s or AMD MI300X cards. Rare for startups. Makes sense at very large scale or for specialized requirements.
For most startups in 2026, the choice is between OpenAI/Anthropic APIs, open model APIs (Together, Fireworks, Groq), and managed GPU hosting (Bedrock, Modal). The full self-managed path is usually overkill until you have an ML team.
The Cost Math That Actually Matters
The temptation is to compare per-token pricing. That number lies. Here is the real cost equation:
For closed APIs (Claude Sonnet, GPT-4o):
- Per-token cost: typically $3 to $15 per million input tokens, $15 to $75 per million output tokens.
- Engineering cost: minimal. Use the SDK, ship in days.
- Infrastructure cost: zero.
- Operational risk: rate limits, vendor changes, no control over latency tail.
For open model APIs (Together, Fireworks, Groq for Llama 3.3 70B):
- Per-token cost: typically $0.50 to $1.20 per million input tokens, $0.50 to $1.20 per million output tokens.
- Engineering cost: small. Slightly different SDK or use OpenAI-compatible endpoints.
- Infrastructure cost: zero.
- Operational risk: vendor stability, model availability changes.
For self-managed GPUs (Llama 3.3 70B on H100s via vLLM):
- GPU cost: H100 instance ~$2 to $4 per hour. You need 2 to 4 for a 70B model with reasonable throughput. Monthly: $3K to $12K per active deployment.
- Engineering cost: 1 to 2 senior ML engineers at $200K to $300K each. $300K to $600K per year.
- Infrastructure cost: monitoring, autoscaling, networking. $1K to $5K per month.
- Operational risk: full ownership. Outages are yours.
Break-even analysis. Self-managed GPU infrastructure beats paying for closed APIs at roughly $50K to $150K per month in API spend. Below that, the engineering overhead is not worth the savings. Above that, savings can be 5 to 10x.
Open model APIs sit in between. They beat closed APIs as soon as your model performance allows the swap, with minimal engineering investment.
When Open Models Are Good Enough
The biggest blocker to self-hosting is not cost. It is quality. If Llama 3.3 70B fails at your task and Claude succeeds, no amount of cost savings justifies the swap.
Here is what open models do well in 2026:
- Classification. Sentiment, intent, topic, category. Llama or Mistral matches GPT-4 quality on most classification tasks.
- Summarization. Especially abstractive summarization of articles, documents, conversations. Open models are competitive.
- Extraction. Structured extraction from unstructured text (entities, fields, key facts). Both open and closed models are strong here.
- Translation. Especially high-resource languages. Llama 3.3 405B is competitive with GPT-4o.
- Boilerplate generation. Code completion, email drafts, marketing copy. Open models handle these well at lower cost.
- RAG-grounded Q&A. When the model is given relevant context, the difference between Claude and Llama narrows significantly.
Here is what closed models still do better:
- Complex reasoning. Multi-step math, formal logic, legal analysis. Claude Sonnet and o1 still lead.
- Long-context tasks. 100K+ token contexts. Claude is currently the best.
- Tool use and function calling. Closed models have more reliable tool calling out of the box.
- Vision. Multimodal understanding is more polished in closed models.
- Code generation for unfamiliar languages and libraries. Closed models still have an edge for niche languages.
- Safety and refusal behavior. Closed models are tuned for content moderation; open models require more guardrails.
The right test is to run your actual task on both types of models and measure with your own evals. Our LLM API pricing guide covers the comparative cost story in more detail.
Infrastructure Requirements for Self-Hosting
If you decide to self-host, here is what you actually need to run a 70B parameter model in production.
GPU hardware.
- Llama 3.3 70B in FP16. Requires ~140 GB of VRAM. Two H100 80GB GPUs (160 GB total) is the standard config. ~$5 to $8 per hour on AWS or specialty providers.
- Llama 3.3 70B quantized to INT8. Half the VRAM. Single H100 (80 GB) works. Quality drop is minimal for most tasks. ~$2.50 to $4 per hour.
- Llama 3.3 70B quantized to INT4. Quarter the VRAM. Runs on consumer GPUs like RTX 4090. Quality drop is noticeable for complex tasks but acceptable for many.
- Smaller models (8B, 13B). Single GPU, much cheaper. Mistral 7B or Llama 3.3 8B cost a fraction of 70B and handle simpler tasks well.
Inference engine.
- vLLM. The most popular open-source inference engine. Supports continuous batching (multiple requests served simultaneously), paged attention, quantization. Production-ready.
- Text Generation Inference (TGI). Hugging Face's inference engine. Mature, well-supported.
- SGLang. Fast and feature-rich; growing in 2026.
- Triton Inference Server. NVIDIA's general-purpose inference server. More work to set up but enterprise-grade.
Orchestration.
- Kubernetes with GPU operator for cluster management.
- Autoscaling: KEDA or vertical pod autoscaler for GPU pods.
- Load balancing: NGINX, Envoy, or your cloud's L7 LB.
- Monitoring: Prometheus, Grafana, plus token-level metrics from your inference engine.
Networking and security.
- Private VPC. GPU instances are expensive; do not expose them to the public internet.
- API gateway for authentication, rate limiting, request logging.
- Audit logging for compliance requirements.
Plan 2 to 6 months for a small ML platform team to set this up cleanly. Skipping any step (no autoscaling, no monitoring, no rate limiting) leads to outages or runaway costs.
Fine-Tuning: The Real Self-Hosting Advantage
The biggest reason to self-host is not cost, it is fine-tuning. Closed APIs offer fine-tuning, but with restrictions: limited base models, no control over hyperparameters, inability to inspect or share weights, vendor lock-in.
With a self-hosted open model, fine-tuning becomes a real product capability:
- LoRA fine-tuning. Parameter-efficient fine-tuning that adapts a base model to your domain with 1 to 10K training examples. Costs $50 to $500 per training run. Quality improvements on domain-specific tasks can be substantial.
- Full fine-tuning. Updates all model parameters. More expensive ($1K to $20K per run) and requires more compute, but produces stronger improvements for tasks where LoRA falls short.
- Continuous training. Roll updates into your base model as you collect more user feedback. Self-hosted gives you full control over the data and the training cycle.
- Data privacy. Fine-tuning data never leaves your infrastructure. Critical for healthcare, legal, finance.
- Multi-tenant fine-tuning. Train per-customer adapters using LoRA. Each customer gets a model tuned to their data, served from a shared base model. This is the killer pattern for vertical SaaS with AI.
If your product depends on fine-tuned models for differentiation, self-hosting is the right answer. If you are using off-the-shelf prompting, the case is weaker.
Operational Reality: What Goes Wrong
The cost savings of self-hosting come with operational overhead that founders systematically underestimate. Here is what actually goes wrong.
GPU availability. H100s are still constrained in some regions in 2026. You may book capacity that fills up or get evicted from spot instances. Plan for capacity reservations and multi-region deployments.
Cold starts. Loading a 70B model into VRAM takes 30 to 90 seconds. Autoscaling is therefore lumpy. You either keep instances warm (paying for idle time) or accept latency spikes during scale-up events.
Latency tail. Self-hosted inference has more variable latency than managed APIs. P99 latency can be 5 to 10x worse than p50 if your batching, queuing, and scheduling are not tuned. This is a real engineering problem.
Model drift. Your fine-tuned model degrades over time as user behavior shifts. You need a continuous evaluation pipeline to detect drift before users complain.
Security. You are now responsible for the security of your inference cluster, your model weights, and your training data. Misconfigured S3 buckets containing customer training data are a real risk.
Cost runaway. An unbounded GPU autoscaler can rack up $10K of unexpected cost in a weekend if an upstream system goes haywire. Monitoring and quotas matter.
Vendor lock-in by another name. Self-hosting on AWS Bedrock or GCP Vertex avoids OpenAI lock-in but creates AWS or GCP lock-in. Be honest about which trade-off you are making.
Our LLM API cost management guide covers patterns for taming both API and self-hosted LLM costs.
My Recommendation by Stage and Use Case
Honest pick for 2026:
Pre-PMF, exploring AI features: Use closed APIs (OpenAI, Anthropic). Speed matters. Cost does not yet. Iterate fast.
Post-PMF, simple use cases: Use open model APIs (Together, Fireworks, Groq for Llama or Mistral). Cheaper than closed APIs. Almost no engineering overhead. Easy to swap if quality matters.
Post-PMF, complex reasoning: Use closed APIs (Claude, GPT-4o). Quality is still the moat for complex use cases. Pay the premium until your task gets simpler or open models catch up.
Established product, $50K+/month LLM spend: Evaluate self-hosting for your highest-volume tasks. Use OpenRouter or similar to compare quality across models on your real workload. Move what works to self-hosting.
Established product, $200K+/month LLM spend: Self-host wherever quality permits. Hire 1 to 2 ML engineers. The savings will more than cover them.
Healthcare, legal, finance: Self-host for any patient or client data, regardless of cost. Privacy and compliance trump everything. Use closed APIs only with explicit legal review.
Multi-tenant SaaS with per-customer fine-tuning: Self-host. LoRA adapters per customer is the killer architecture and only works on infrastructure you control.
Vertical AI products with proprietary data: Self-host with fine-tuning. Your data is your moat. Closed APIs let you use it but cannot turn it into a model differentiator.
Embedded or edge AI: Self-host with quantized models. On-device or VPC inference is the only option for many edge use cases.
The single biggest mistake founders make in this decision is self-hosting too early because the per-token math looks good. The engineering and operational cost of self-hosting is real, and it dominates the equation until your spend is large enough to amortize it. If you are spending under $20K/month on closed APIs, the right move is almost always to keep using closed APIs and focus on building features.
If you want help running the math for your specific workload or evaluating whether self-hosting will actually save you money, book a free strategy call. I can help you avoid spending six months building infrastructure that does not pay back.
Need help building this?
Our team has launched 50+ products for startups and ambitious brands. Let's talk about your project.