WebRTC Fundamentals You Need to Know
WebRTC (Web Real-Time Communication) is an open standard built into every major browser. It handles media capture, encoding, transmission, and rendering without plugins. Chrome, Firefox, Safari, and Edge all support it. React Native and Flutter have WebRTC libraries for mobile.
The core WebRTC APIs do three things: getUserMedia captures camera and microphone input. RTCPeerConnection establishes a peer-to-peer connection and handles media transport. RTCDataChannel sends arbitrary data (chat messages, file transfers) alongside the media stream.
What WebRTC does not include, and what you need to build, is everything around the connection. Signaling (how two peers find each other and negotiate a connection), TURN relay (fallback when peer-to-peer fails), group call routing (WebRTC is inherently peer-to-peer), recording, and quality monitoring all sit outside the WebRTC specification.
This is where the complexity and cost live. A basic 1-on-1 video call using WebRTC takes a skilled developer 2 to 3 weeks. A group video platform with recording and screen sharing takes 4 to 6 months. Understanding these layers helps you scope the project correctly.
Signaling: Connecting Two Peers
Before two browsers can exchange video, they need to exchange connection metadata: what codecs they support, their network addresses (ICE candidates), and encryption keys. This exchange is called signaling, and WebRTC deliberately leaves it unspecified so you can implement it however you want.
How Signaling Works
Peer A creates an "offer" (SDP: Session Description Protocol) describing its media capabilities. The offer travels through your signaling server to Peer B. Peer B creates an "answer" with its own capabilities. Both peers exchange ICE candidates (network addresses) to find the best connection path. Once they agree on a path, media flows directly between them.
Signaling Server Implementation
WebSocket is the standard transport for signaling. A Node.js server with Socket.io handles signaling for most applications. For larger deployments, use Redis pub/sub behind multiple WebSocket servers for horizontal scaling. The signaling server handles room management (who is in which call), presence (who is online), and call state (ringing, connected, ended).
Budget $10K to $25K for a production signaling server with room management, presence, and reconnection handling. The server itself is lightweight (100 concurrent calls use minimal CPU and bandwidth), but the edge cases around reconnection, network changes, and browser compatibility require careful engineering.
STUN and TURN Servers
STUN servers help peers discover their public IP addresses. Google provides free STUN servers (stun.l.google.com:19302), which work for development. TURN servers relay media when peer-to-peer connections fail (roughly 10 to 15% of connections, higher in corporate networks). Self-host coturn on AWS ($50 to $200/month per server) or use Twilio Network Traversal ($0.0004 per relay minute). Deploy TURN servers in at least 3 regions for global coverage.
Group Calls: SFU Architecture
Peer-to-peer WebRTC works for 1-on-1 calls. For group calls with more than 3 or 4 participants, you need a Selective Forwarding Unit (SFU).
Why Peer-to-Peer Breaks for Groups
In a peer-to-peer mesh, every participant sends their video to every other participant. With 4 people, each person uploads 3 streams and downloads 3 streams. With 10 people, that is 9 uploads and 9 downloads per person. Most consumer internet connections cannot handle more than 3 to 4 simultaneous uploads of HD video.
How an SFU Works
Each participant sends one video stream to the SFU server. The SFU selectively forwards each stream to other participants based on their bandwidth, viewport size, and the current speaker. A participant viewing a gallery of 9 small tiles receives low-resolution streams. A participant viewing the active speaker receives one high-resolution stream and 8 low-resolution thumbnails.
SFU Options
- mediasoup: Open-source, Node.js-based. The most popular self-hosted SFU. Excellent performance (handles 100+ participants per server). Active community. Budget $15K to $30K for integration and deployment.
- LiveKit: Open-source with a managed cloud option. Built in Go for high performance. Includes built-in recording, egress, and ingress. Managed cloud starts at $0.006 per participant-minute. Self-hosted is free. Budget $10K to $25K for integration.
- Janus: Open-source, C-based. Very performant but harder to extend. Best for teams with C/C++ expertise.
- Ion-SFU: Go-based, lightweight. Good for simple group calling without advanced features.
Our recommendation: LiveKit for most projects. It bundles SFU, recording, and streaming in one package, has excellent SDKs for web, React Native, Flutter, and native mobile, and offers both self-hosted and managed options. For details on building real-time features, check our dedicated guide.
Essential Features and Build Cost
Here is the feature set for a production video calling app and what each one costs to build:
Screen Sharing: $8K to $15K
Uses the getDisplayMedia browser API. Works seamlessly on desktop browsers. Mobile screen sharing requires platform-specific implementations (iOS Broadcast Extension, Android MediaProjection). Add annotation (drawing on shared screen) for another $10K to $15K using a canvas overlay synced via the data channel.
Recording: $20K to $45K
Two approaches: composite recording (single video mixing all participants, like a Zoom recording) or individual track recording (separate files per participant for post-production). LiveKit Egress handles both. Self-hosted recording uses FFmpeg pipelines on GPU-enabled instances. Storage on S3 at $0.023/GB. Transcoding to multiple resolutions adds processing cost.
Chat and Reactions: $8K to $15K
In-call text chat via WebRTC DataChannel or a parallel WebSocket connection. Emoji reactions, hand raising, and polls. Persist chat history for post-call reference. These features are straightforward but important for user experience in meetings.
Virtual Backgrounds: $10K to $20K
Real-time body segmentation using TensorFlow.js (BodyPix or MediaPipe Selfie Segmentation). Runs client-side on the user's GPU. Performance varies by device; provide a fallback for low-powered devices. Blur background is simpler than image replacement and works well as a default option.
Breakout Rooms: $12K to $25K
Move participants between SFU rooms dynamically. Requires room management logic, a moderator interface, timers, and automatic return to the main room. The SFU handles media routing; your application layer manages the room assignments and transitions.
Waiting Room: $5K to $10K
Hold participants in a pre-call state until the host admits them. Important for security in healthcare, education, and business contexts. Simple to implement but requires careful UX design for the host's admit/deny interface.
Quality Monitoring and Optimization
Video quality problems are the number one complaint in video calling apps. Proactive monitoring prevents user frustration.
WebRTC Statistics API
RTCPeerConnection.getStats() provides real-time metrics: bitrate, packet loss, jitter, round-trip time, frame rate, and resolution. Collect these metrics every 2 to 5 seconds and send them to your analytics backend. Build dashboards that show call quality across your user base, broken down by browser, device, network type, and region.
Adaptive Bitrate
WebRTC has built-in bandwidth estimation, but you should supplement it with application-level logic. When packet loss exceeds 5%, reduce video resolution. When bandwidth drops below 500kbps, switch to audio-only and show a static avatar. When network conditions improve, gradually restore quality. Users tolerate lower resolution far better than stuttering or freezing.
Simulcast
Simulcast sends multiple quality layers (high, medium, low) from each participant. The SFU selects the appropriate layer for each viewer based on their bandwidth and viewport size. This is the key technique that makes group calls work on mixed-quality networks. LiveKit and mediasoup both support simulcast natively.
Network Resilience
Handle network interruptions gracefully. WebRTC's ICE restart mechanism can recover from network changes (WiFi to cellular, IP address change) without dropping the call. Implement automatic reconnection with exponential backoff. Show clear UI indicators ("Reconnecting...") so users know the app is working to restore the connection rather than frozen.
Scaling to Thousands of Concurrent Calls
A single SFU server handles 50 to 200 concurrent group calls depending on participant count and resolution. Scaling beyond that requires distributed architecture.
Horizontal SFU Scaling
Deploy SFU instances across multiple availability zones and regions. Use a routing layer that assigns calls to the nearest SFU instance based on participant locations. LiveKit's distributed architecture handles this natively. For mediasoup, you need a custom routing layer that tracks room assignments and SFU capacity.
Geographic Distribution
Media latency is dominated by physical distance. A call between New York and London has 70ms minimum round-trip time due to the speed of light through fiber. Deploy SFU instances in regions where your users are concentrated: US East, US West, Europe, and Asia-Pacific cover most global use cases. Use Anycast DNS or a global load balancer to route users to the nearest SFU.
Infrastructure Cost at Scale
- 100 concurrent calls: 2 to 3 SFU instances, $500 to $1,500/month
- 1,000 concurrent calls: 10 to 20 SFU instances, $3,000 to $10,000/month
- 10,000 concurrent calls: 50 to 100 SFU instances across regions, $15,000 to $50,000/month
These costs are for compute only. Add TURN relay ($500 to $5,000/month), recording storage and processing ($1,000 to $10,000/month), and monitoring infrastructure ($500 to $2,000/month). For guidance on scaling your app for growing users, plan your infrastructure strategy early.
Timeline and Realistic Budgets
Here are three project scopes with realistic timelines and budgets:
1-on-1 Video Feature (8 to 12 weeks, $40K to $80K)
Add video calling to an existing app. WebRTC with a CPaaS provider (Daily.co or Twilio), signaling, basic UI, screen sharing, and chat. Good for telehealth, tutoring, or customer support apps.
Group Video App (16 to 24 weeks, $120K to $220K)
Standalone video calling with group calls (up to 25 participants), screen sharing, recording, chat, virtual backgrounds, and a scheduling interface. Built on LiveKit or mediasoup. Web and mobile (React Native).
Enterprise Video Platform (30 to 48 weeks, $250K to $450K)
Large meetings (100+ participants), breakout rooms, webinar mode with up to 1,000 viewers, cloud recording with transcription (using Deepgram or AssemblyAI), SSO integration, admin dashboard, analytics, and multi-region deployment.
The technology stack for video calling is mature and well-documented. The challenge is not "can we build it" but "how do we make it reliable at scale." Focus your engineering budget on quality monitoring, network resilience, and edge case handling. These are what separate a good video app from a frustrating one.
If you are building a video calling product or adding video features to an existing app, book a free strategy call with our team to scope the project.
Need help building this?
Our team has launched 50+ products for startups and ambitious brands. Let's talk about your project.