Technology·15 min read

OpenAI Realtime API vs LiveKit Agents vs ElevenLabs: Voice AI

Three very different approaches to building voice AI. OpenAI went speech-to-speech. LiveKit went open source orchestration. ElevenLabs doubled down on voice quality. Here is how they actually compare in production.

Nate Laquis

Nate Laquis

Founder & CEO

The Voice AI Architecture Shift

For the first wave of voice AI products, everyone built the same way: chain together a speech-to-text engine, an LLM, and a text-to-speech engine. Deepgram transcribes the caller's words. GPT-4 or Claude generates a response. ElevenLabs or PlayHT synthesizes audio. Three API calls, three network hops, three vendors. It worked, but the latency floor was around 600ms to 1.2 seconds even with aggressive streaming.

That pipeline approach still dominates production deployments in 2026, and for good reason. It gives you full control. You can swap any component, mix the best STT with the best LLM with the best TTS, and optimize each piece independently. Platforms like Vapi, Retell, and Bland AI all use variations of this pattern under the hood.

But three new approaches have emerged that challenge the pipeline model in different ways. OpenAI's Realtime API bypasses the pipeline entirely with speech-to-speech inference. LiveKit Agents provides an open source orchestration layer that makes the pipeline faster and more flexible. ElevenLabs Conversational AI combines their industry-leading TTS with a managed agent runtime. Each makes fundamentally different tradeoffs around latency, cost, control, and voice quality.

Data center servers powering real-time voice AI inference at scale

Choosing between them is not just a technical decision. It determines your cost structure at scale, your ability to customize, your vendor lock-in risk, and whether you can meet the latency requirements your users actually care about. We have deployed all three in production for clients, and the right answer depends entirely on the use case.

OpenAI Realtime API: Speech-to-Speech Without the Pipeline

OpenAI's Realtime API is the most radical departure from the traditional voice pipeline. Instead of converting speech to text, processing text through an LLM, and converting the response back to speech, the Realtime API processes audio natively. You send raw audio in over a WebSocket connection and receive generated audio back. The model "thinks" in audio, which eliminates two entire steps from the pipeline.

The protocol is WebSocket-based with a persistent, bidirectional connection. You stream PCM16 audio frames to the server, and the model streams audio frames back. The API supports both server-side and client-side voice activity detection (VAD). Server-side VAD is simpler to implement: you just keep sending audio and the model decides when the user has stopped speaking. Client-side VAD gives you more control but requires running Silero or a similar model locally.

Function Calling in Real Time

The Realtime API supports function calling, which is what makes it viable for production voice agents rather than just demos. You define tools in the session configuration, and the model will invoke them mid-conversation. When the model calls a function, the audio stream pauses, you execute the function server-side, return the result, and the model resumes speaking with the function output incorporated into its response. Latency during function calls adds 200 to 500ms depending on your function execution time.

Latency and Quality

Raw response latency is the Realtime API's biggest advantage. Because there is no STT or TTS step, the time from end of user speech to first audio byte of the response is typically 300 to 500ms. That is faster than any pipeline-based system we have tested. Conversations feel genuinely natural at this speed.

The catch is voice quality. The speech-to-speech model generates audio that sounds good but not great. It lacks the nuance and naturalness of dedicated TTS engines like ElevenLabs Turbo v3. Voices can sound slightly metallic or flat during longer utterances. You also get a limited selection of preset voices with no custom voice cloning.

Pricing

OpenAI charges for Realtime API usage based on audio duration and tokens. As of mid-2026, pricing for the gpt-4o-realtime model breaks down to roughly $0.06 to $0.10 per minute of conversation, depending on how much the model speaks versus listens. That is more expensive per minute than assembling your own pipeline ($0.04 to $0.07), but you are paying for simplicity and low latency. At 10,000 minutes per month, expect a bill of $600 to $1,000 from OpenAI alone, before telephony costs.

One important limitation: the Realtime API only supports OpenAI models. You cannot use Claude, Gemini, or open source LLMs. If OpenAI raises prices or degrades quality, your only option is to rebuild on a different architecture entirely.

LiveKit Agents: Open Source Orchestration for Voice AI

LiveKit takes the opposite approach from OpenAI. Instead of replacing the pipeline, LiveKit Agents makes the pipeline better. It is an open source Python framework for building real-time AI agents that can see, hear, and speak. The framework handles all the hard parts of real-time media: WebRTC connections, audio routing, echo cancellation, noise suppression, and synchronization between components.

The core concept is a "room." A LiveKit room is a real-time communication session, similar to a video call. Your AI agent joins the room as a participant alongside the human user. Audio flows between participants through LiveKit's media server with sub-100ms transport latency. The agent subscribes to the user's audio track, processes it through a configurable pipeline, and publishes its response as a new audio track.

Team discussing voice AI architecture decisions in a meeting room

Plugin Architecture

LiveKit Agents uses a plugin system that lets you swap any component in the voice pipeline. The framework provides official plugins for Deepgram STT, OpenAI STT and TTS, ElevenLabs TTS, Cartesia TTS, Google Cloud STT, Anthropic Claude, Azure Cognitive Services, and more. You can also write custom plugins. A typical agent configuration looks like: Deepgram for STT, Claude Haiku for the LLM, and ElevenLabs Turbo v3 for TTS. But you could swap Claude for GPT-4o, or Deepgram for AssemblyAI, or ElevenLabs for Cartesia, all with a few lines of configuration.

This modularity is LiveKit's biggest advantage. You are not locked into any single AI vendor. When a new STT engine launches with better accuracy, you swap the plugin. When a TTS provider drops their prices, you switch. Your agent code stays the same.

Telephony Integration

LiveKit has first-class SIP support. Their SIP bridge connects LiveKit rooms to the PSTN (Public Switched Telephone Network), meaning your AI agent can receive and make actual phone calls. You configure SIP trunks through Twilio, Telnyx, or Vonage, and LiveKit handles the media bridging. Inbound calls get routed to a LiveKit room where your agent is waiting. Outbound calls work the same way in reverse.

Pricing

LiveKit's server infrastructure (LiveKit Cloud) charges based on participant-minutes. For voice-only agents, the cost is roughly $0.004 per participant-minute, which means $0.008 per minute of conversation (two participants: the agent and the human). That is just the infrastructure cost. You still pay separately for STT, LLM, and TTS from your chosen providers. Total pipeline cost with LiveKit Cloud: $0.04 to $0.09 per minute depending on your component choices.

If you self-host the LiveKit server, the infrastructure cost drops to whatever your compute costs are. The entire stack is open source under the Apache 2.0 license. A single modest VM can handle 50 to 100 concurrent voice agent sessions. For teams with DevOps capability, self-hosting LiveKit is one of the most cost-effective ways to run voice AI at scale.

ElevenLabs Conversational AI: Voice Quality as the Differentiator

ElevenLabs built their reputation on having the best text-to-speech in the industry, and their Conversational AI product leverages that advantage. Where OpenAI offers speech-to-speech and LiveKit offers pipeline orchestration, ElevenLabs offers a managed agent runtime with unmatched voice quality and the most advanced voice cloning capabilities available.

The Conversational AI product is a full agent builder. You define your agent's persona, system prompt, knowledge base, and tools through their dashboard or API. ElevenLabs handles the STT (they use Deepgram or their own model under the hood), routes to an LLM of your choice (GPT-4o, Claude, Gemini, or custom endpoints), and synthesizes the response with their own TTS. The key differentiator is that last step: nobody else produces voice output this natural.

Voice Cloning and Customization

ElevenLabs offers three tiers of voice cloning. Instant Voice Cloning requires just 30 seconds of sample audio and produces a usable clone in minutes. Professional Voice Cloning uses 30 to 60 minutes of studio-quality recordings and generates a near-perfect replica. The third tier, custom fine-tuned models, is available on enterprise plans for brands that need pixel-perfect voice reproduction. We have used Professional Voice Cloning for a healthcare client's patient outreach agent, and callers genuinely could not tell they were speaking with an AI.

Multi-Language Support

ElevenLabs supports 32 languages with high-quality voice synthesis, and their models handle code-switching (mixing languages mid-sentence) better than any competitor. For businesses operating globally, this is a significant advantage. A single agent can greet callers in English, switch to Spanish when it detects the caller's preference, and maintain the same voice identity across languages. OpenAI Realtime supports about 15 languages. LiveKit's language support depends on which STT and TTS plugins you choose.

Latency

ElevenLabs optimized their Conversational AI for low latency. Their Turbo v3 TTS model delivers time-to-first-byte under 150ms. Combined with streaming STT and fast LLM inference, total pipeline latency lands between 500ms and 800ms in production. That is slower than OpenAI's Realtime API but competitive with a well-optimized LiveKit pipeline. ElevenLabs also offers a "low latency" mode that trades minor voice quality for faster response, bringing total latency under 600ms.

Pricing

ElevenLabs Conversational AI pricing is conversation-minute-based. The Growth plan charges roughly $0.08 to $0.12 per minute of conversation, which includes STT, LLM routing, and TTS. Enterprise plans with custom voices and higher concurrency come with negotiated pricing that can drop to $0.05 to $0.08 per minute at volume. Compared to assembling your own pipeline, the premium is 20 to 40%, but you save significant engineering time and get voice quality that is hard to match any other way.

Head-to-Head Comparison: Latency, Quality, Cost, and Scale

We have benchmarked all three platforms in production environments with real telephony traffic. Here is how they compare across the metrics that actually matter.

Latency (End of Speech to First Response Audio)

  • OpenAI Realtime API: 300 to 500ms. Best in class because there is no pipeline. The model processes audio natively.
  • LiveKit Agents (optimized pipeline): 500 to 800ms. Depends heavily on component choices. Deepgram STT + Claude Haiku + Cartesia TTS hits the low end. Switching to GPT-4o + ElevenLabs pushes toward the high end.
  • ElevenLabs Conversational AI: 500 to 800ms in standard mode, 400 to 600ms in low-latency mode. Consistent but slightly less tunable than LiveKit.

Voice Quality

  • OpenAI Realtime API: Good but not exceptional. Voices sound clear and intelligible. Lacks the warmth and expressiveness of dedicated TTS engines. No custom voice cloning.
  • LiveKit Agents: Depends entirely on your TTS plugin. With ElevenLabs Turbo v3, voice quality is outstanding. With Cartesia Sonic, it is fast but slightly less natural. You choose your tradeoff.
  • ElevenLabs Conversational AI: Best in class. Period. Their voices are the most natural-sounding in the industry, and custom voice cloning adds another level of brand differentiation.

Cost Per Minute at Scale (10,000+ minutes/month)

  • OpenAI Realtime API: $0.06 to $0.10/min (all-in except telephony). Simple pricing but no way to reduce it by swapping components.
  • LiveKit Agents (Cloud): $0.04 to $0.09/min (infrastructure + STT + LLM + TTS). Most flexible. Self-hosting drops this further.
  • LiveKit Agents (self-hosted): $0.03 to $0.07/min. Best cost efficiency if you have the DevOps team to manage it.
  • ElevenLabs Conversational AI: $0.05 to $0.12/min depending on plan. Premium pricing for premium voice quality.

Language Support

  • OpenAI Realtime: ~15 languages with varying quality.
  • LiveKit Agents: Depends on plugins. Deepgram supports 36 languages for STT. TTS language support varies by provider.
  • ElevenLabs: 32 languages with excellent quality across all of them. Best code-switching support.

Customization and Control

  • OpenAI Realtime: Limited. You get their model, their voices, their protocol. Function calling is flexible, but infrastructure decisions are made for you.
  • LiveKit Agents: Maximum control. Open source, self-hostable, every component is swappable. You own the entire stack.
  • ElevenLabs: Moderate. You can choose your LLM, clone voices, and customize the agent, but the runtime is managed.

Telephony Integration: SIP, PSTN, and Twilio

If your voice AI agent needs to make or receive actual phone calls, telephony integration is a hard requirement. This is where the three platforms diverge significantly.

Developer writing voice AI integration code with real-time streaming

OpenAI Realtime API + Telephony

The Realtime API has no built-in telephony support. You need to build a media bridge that converts between the SIP/RTP audio format used by telephony providers and the WebSocket-based protocol the Realtime API expects. Twilio's Media Streams works for this: it sends real-time audio from a phone call to your WebSocket server, which then forwards it to OpenAI. The return audio goes back the same way. Expect 2 to 4 weeks of engineering work to get this right, including handling call events, DTMF tones, call transfers, and recording. Several open source bridges exist on GitHub that can accelerate this, but they all need production hardening.

LiveKit Agents + Telephony

LiveKit has the strongest telephony story. Their SIP bridge is a first-party component that connects SIP trunks directly to LiveKit rooms. Configuration is straightforward: you set up a SIP trunk with Twilio, Telnyx, or any SIP provider, point it at your LiveKit server, and define dispatch rules that route incoming calls to your agent. Outbound calls work through the same SIP trunk. The SIP bridge handles codec negotiation, DTMF, call transfer (via SIP REFER), and hold/resume. For teams building call center replacements or high-volume outbound dialers, LiveKit's telephony integration is the most production-ready option.

ElevenLabs + Telephony

ElevenLabs offers built-in Twilio integration for their Conversational AI agents. You connect your Twilio account, assign a phone number, and the agent handles calls directly. Setup takes about 30 minutes if you already have a Twilio account. The integration handles inbound and outbound calls, call recording, and basic call transfer. For more complex telephony requirements (IVR trees, queue management, multi-agent routing), you will need to build additional logic on the Twilio side.

For pure web-based voice interactions (in-browser or in-app), all three platforms work well. OpenAI and LiveKit use WebRTC. ElevenLabs offers both WebRTC and WebSocket options. Web-based deployment avoids telephony costs entirely, which can save $0.01 to $0.03 per minute. If your use case is a voice-enabled chatbot on your website or mobile app, telephony complexity becomes irrelevant. As covered in our guide on streaming AI response patterns, choosing the right transport protocol matters more than the telephony layer for these deployments.

Recommendations by Use Case

After deploying all three platforms across different client projects, here are our opinionated recommendations for specific use cases.

Customer Support Call Centers

Use LiveKit Agents with self-hosted infrastructure. Customer support at scale is all about cost efficiency and reliability. LiveKit gives you the lowest per-minute cost when self-hosted, full control over the pipeline, and rock-solid telephony integration. Pair it with Deepgram Nova-3 for STT, Claude Haiku for fast inference on routine queries (with automatic escalation to Claude Sonnet for complex issues), and Cartesia Sonic for low-latency TTS. Total cost: $0.03 to $0.05 per minute at 50,000+ minutes per month. We walk through the full architecture in our guide to building AI voice agents.

Sales and Lead Qualification

Use ElevenLabs Conversational AI. Sales calls demand voice quality above everything else. A robotic-sounding agent kills conversion rates. ElevenLabs' voice cloning lets you create a consistent brand voice, and their multi-language support opens international markets without building separate agents. The higher per-minute cost ($0.08 to $0.12) is justified when each qualified lead is worth hundreds or thousands of dollars. Pair with GPT-4o for its strong instruction following on sales scripts.

Healthcare and Patient Communication

Use LiveKit Agents with a HIPAA-compliant hosting setup. Healthcare requires data sovereignty, audit logging, and strict compliance. LiveKit's self-hosted model means patient audio never leaves your infrastructure. You control encryption, retention, and access. Use a HIPAA-eligible LLM endpoint (Azure OpenAI or Anthropic's API with a BAA in place). For appointment reminders and prescription refill calls, this setup provides compliance without sacrificing conversational quality.

Gaming and Interactive Entertainment

Use OpenAI Realtime API. Games and entertainment applications prioritize latency and immersion over cost. The Realtime API's 300 to 500ms response time creates genuinely conversational NPCs and interactive characters. The speech-to-speech model also picks up on tone and emotion in ways that text-intermediated pipelines cannot, adding a layer of expressiveness that matters for entertainment. At gaming scale (millions of short interactions), the per-minute cost is manageable because individual interactions are brief (30 to 90 seconds).

Internal Tools and Prototypes

Use OpenAI Realtime API for prototypes. When speed of development matters more than production cost optimization, the Realtime API is the fastest path to a working voice agent. A single WebSocket connection replaces three separate API integrations. You can have a functional prototype in a day. If the project moves to production at scale, consider migrating to LiveKit Agents for cost savings.

When to Combine Approaches

Some of our most successful deployments use multiple platforms. One client runs ElevenLabs for their premium sales line (where voice quality drives revenue) and LiveKit Agents for their support line (where cost efficiency matters at 200,000 calls per month). The LLM, knowledge base, and business logic are shared. Only the voice pipeline differs. This hybrid approach optimizes for what actually matters in each context.

Voice AI infrastructure is evolving fast. OpenAI, LiveKit, and ElevenLabs all ship major updates monthly. Whichever platform you choose, architect for swappability. Abstract your business logic from your voice pipeline so you can migrate without rewriting your agent. If you need help choosing the right architecture for your specific requirements, book a free strategy call and we will walk through the tradeoffs together.

Need help building this?

Our team has launched 50+ products for startups and ambitious brands. Let's talk about your project.

OpenAI Realtime APILiveKit AgentsElevenLabs Conversational AIvoice AI comparisonreal-time voice AI

Ready to build your product?

Book a free 15-minute strategy call. No pitch, just clarity on your next steps.

Get Started