What an Ambient AI Scribe Actually Does
An ambient AI medical scribe passively listens to the unstructured conversation between a clinician and patient, then produces a structured clinical note, billing codes, patient instructions, and in many cases an order draft, all without the clinician touching a keyboard during the visit. Unlike older dictation products like Dragon Medical One, ambient scribes require no trigger words, no template selection, and no explicit commands. The clinician opens the app, starts the encounter, and walks out with a finished draft note ready for signoff inside the EHR.
The reference products in this category are Abridge, Ambience Healthcare, Nuance DAX Copilot, Suki, DeepScribe, and Nabla. Each of these handles five core jobs: multi-speaker audio capture, medical-grade speech recognition, speaker diarization to separate clinician from patient, clinical entity extraction, and structured note synthesis in a format like SOAP, APSO, or H and P.
The market momentum is real. Abridge crossed a $2.5B valuation in 2025. Ambience raised $70M in Series B. Nuance DAX is embedded in 500+ health systems through the Microsoft deal. Health systems are writing seven-figure annual contracts because clinician burnout is a board-level problem and ambient scribes cut documentation time by 60 to 80%.
The Reference Architecture
A production ambient scribe is a distributed system with four logical tiers. Understanding each tier is essential before you write a line of code.
The capture tier runs on a mobile device, a browser, or a dedicated room microphone and handles real-time audio streaming over a secure WebSocket or WebRTC connection. This is where quality and latency decisions are made.
The transcription tier converts audio to text using a medical-tuned automatic speech recognition (ASR) model. This is where you choose between managed vendors (Deepgram, AssemblyAI) and self-hosted open-source (Whisper).
The understanding tier runs medical named entity recognition, diarization alignment, and section classification. This is where you turn raw text into structured clinical facts.
The generation tier uses a large language model to produce the final structured note, billing codes, and patient summary. This is where most of your LLM budget goes.
On the backend you need a HIPAA-eligible cloud provider (AWS, Google Cloud, or Azure) with a signed Business Associate Agreement in place. Your storage layer needs envelope encryption, your compute layer needs VPC isolation, and your data pipeline needs full audit logging of every PHI touch point.
The same architectural principles apply whether you are building a scribe or any other clinical tool. We covered the foundational patterns in our earlier guide on how to build a healthcare app.
Audio Capture and Preprocessing
Capture quality determines everything downstream. A two-speaker exam room is very different from a noisy emergency department bay or a hospital corridor, and your product needs to handle all of them.
Mobile capture. On iOS, use AVAudioEngine with voice processing enabled. On Android, use the Oboe library with the VOICE_COMMUNICATION audio source. Both platforms give you echo cancellation, automatic gain control, and noise suppression at the OS level.
Browser capture. Use the WebAudio API with echoCancellation, noiseSuppression, and autoGainControl flags enabled in the getUserMedia constraints. Chrome and Safari handle this well. Firefox is less consistent.
Audio format. Sample at 16 kHz mono for speech. Use Opus encoding at 24 kbps for streaming to keep bandwidth low and quality high. Chunk audio every 100 to 250 milliseconds for low end-to-end latency.
Preprocessing. Run a voice activity detector like Silero VAD or WebRTC VAD to trim silence before you pay for transcription. Apply a light noise suppressor such as RNNoise or NVIDIA Maxine for environments with HVAC or hallway noise. Never apply heavy compression or aggressive gating because it destroys the quiet background speech you need for diarization.
Session management. Handle long encounters (30+ minutes) without memory issues. Split audio into rolling buffers that get uploaded continuously. Never wait for the end of the visit to start uploading.
Choosing Your Speech-to-Text Engine
This is the most important vendor decision you will make, because ASR accuracy on medical terminology drives everything downstream.
Deepgram Nova-3 Medical. The leading streaming option. Word error rate below 4% on medical dictation, native support for 50,000+ medical terms, built-in diarization, HIPAA BAA available. Roughly $0.004 to $0.01 per minute. This is the easiest to integrate for real-time ambient use cases.
AssemblyAI Universal-2. A strong alternative with excellent punctuation and formatting, a solid medical vocabulary boost feature, and very good speaker diarization. Slightly slower in streaming mode but produces cleaner batch transcripts.
OpenAI Whisper Large v3 (self-hosted). The best open-source option. You can self-host on H100 or A100 GPUs, fine-tune on your own medical corpus, and eliminate per-minute cost at scale. Tradeoff: you now operate GPU infrastructure under HIPAA. Break-even vs managed vendors is typically around 50,000 hours of audio per month.
Google Cloud Speech-to-Text Medical Conversation. Purpose-built for clinical dialog with excellent diarization and direct integration with Google Healthcare API and MedLM. Natural choice if you are already on Google Cloud.
Our recommendation. For most teams shipping a first product, start with Deepgram for streaming and keep a Whisper fallback for batch reprocessing of uncertain segments. Migrate pieces to self-hosted Whisper as volume grows.
Diarization, NER, and Clinical Understanding
Raw transcripts are worthless if you cannot tell who said what. Diarization segments audio by speaker. Role assignment labels which speaker is the clinician and which is the patient. Both are required.
Diarization models. Pyannote 3.1 is the current open-source gold standard and runs well on CPU for short encounters or on a T4 GPU for real-time use. NVIDIA NeMo Sortformer is another strong option. Deepgram and AssemblyAI include diarization in their managed offerings.
Role assignment. Audio alone cannot reliably tell the clinician from the patient. Train a lightweight classifier (a fine-tuned BERT model on a few thousand labeled segments) that scores each segment against clinician and patient language patterns. Features: medical vocabulary density, question vs answer structure, first-person vs second-person framing. Accuracy above 95% is achievable.
Multi-speaker encounters. For encounters with family members, translators, or a resident plus attending, build an explicit enrollment flow where the clinician taps to identify themselves at the start of the visit. Every production scribe does this.
Medical NER. Your options are AWS Comprehend Medical, Google Healthcare Natural Language API, or direct LLM extraction with structured output (Claude Sonnet 4.5 or GPT-4o with JSON mode). The LLM path is more accurate on edge cases and handles novel terminology better but costs more per encounter.
Concept linking. Map extracted strings to canonical codes using SNOMED CT, RxNorm, LOINC, and ICD-10. Tools like Amazon HealthLake, Particle Health, or the open-source MedCAT library handle this layer. This step is what enables billing codes and analytics downstream.
SOAP Note Generation with LLMs
Generating the final note is where large language models earn their keep. Your prompt needs to take the diarized transcript, the extracted entities, the patient chart context, and the specialty-specific template, then produce a clinically accurate note in the clinician's preferred style.
Model choice. Claude Sonnet 4.5, GPT-4o, or Gemini 2.5 Pro for production quality. All three offer HIPAA-eligible endpoints from their respective clouds. Open-source options like Llama 3.3 70B Instruct fine-tuned on MIMIC data are viable if you have the MLOps capacity to self-host.
Prompt structure. Always include the raw transcript as the source of truth, a specialty template (primary care, cardiology, orthopedics, etc.), the patient prior problem list, and explicit instructions to never invent facts not present in the transcript.
Grounding and hallucination control. Every generated claim should be traceable to a specific transcript span. Implement a verification pass where a second smaller model checks each sentence in the generated note against the source transcript and flags anything unsupported for clinician review. Hallucination is an existential risk for any clinical product.
Section-by-section generation. A typical SOAP note has Subjective, Objective, Assessment, and Plan sections, and each benefits from its own focused prompt with specialty-specific few-shot examples. This approach also enables streaming the note to the UI as each section completes, which clinicians strongly prefer over waiting for a monolithic result.
Specialty templates. Primary care, behavioral health, dermatology, and urgent care all have different note structures. Build a template library and let customers customize per specialty. This is a moat: the more templates you ship, the harder it is for competitors to match you.
If you are also supporting voice-driven workflows inside the scribe (asking the AI to explain a term or summarize a prior visit), check our deeper guide on voice AI applications for the conversational layer patterns.
EHR Integration: The Hardest Part
No scribe succeeds without seamless integration into the electronic health record. Clinicians will not copy and paste notes, and health systems will not buy anything that creates a second documentation system.
Epic App Orchard and Showroom. The path for any health system running Epic (roughly 40% of the US market by patient volume). You need to become an Epic partner, pass their security and interoperability review, and build against the Epic FHIR APIs plus proprietary HL7 interfaces for note writeback. Timeline: 6 to 9 months from partnership to first live site. Budget $30K to $80K in partner fees.
Oracle Cerner Code Program. The equivalent for Cerner sites (now under Oracle Health). Cerner has a more open FHIR surface than Epic but still requires a formal partnership for production deployment.
Athenahealth Marketplace. Covers the ambulatory and small-practice market with a much faster onboarding cycle (4 to 8 weeks). Athena has solid FHIR R4 coverage and a public developer portal.
Aggregators (Redox, Particle Health, Health Gorilla). Let you ship a single integration that works across dozens of EHRs. Almost always the right starting point for a new scribe because it lets you focus on product instead of per-EHR plumbing. Add direct Epic and Cerner integrations once you have paying customers demanding them.
Note writeback target. Almost always an unsigned draft that lands in the clinician's inbox for review and signature. Direct signing creates liability concerns that most health systems prefer to avoid.
Many of the same integration patterns apply to remote care products. We covered them in depth in our guide on how to build a telemedicine app.
HIPAA, Evaluation, and Production Readiness
HIPAA is the floor, not the ceiling. You need Business Associate Agreements with every subprocessor that touches PHI: your cloud provider, your ASR vendor, your LLM vendor, and your observability stack. AWS, Google Cloud, Azure, Anthropic, OpenAI via Azure, Deepgram, and AssemblyAI all offer BAAs. Smaller vendors often do not, and using one without a BAA is a breach waiting to happen.
Technical safeguards. TLS 1.3 in transit, AES-256 at rest with customer-managed keys, strict access controls with short-lived credentials, full audit logging of every PHI access, automated key rotation, secure deletion workflows, and least-privilege everywhere. Never let a developer laptop touch production PHI.
Evaluation harness. Build a golden set of at least 500 labeled encounters across your target specialties, with human-written reference notes, and run every model change against it before deployment. Track word error rate on transcription, F1 on entity extraction, ROUGE and BERTScore on generated notes, and a clinical accuracy score from a panel of reviewing physicians. The physician score is the only one that actually matters to customers.
Production monitoring. Log every encounter with full traceability from audio to transcript to entities to note. Monitor latency percentiles at every stage. Track hallucination rate with automated fact-checking. Give clinicians a one-tap feedback mechanism on every note and feed low-scoring cases back into your evaluation set and fine-tuning pipeline.
Cost model. A 20-minute encounter costs roughly $0.18 for streaming ASR, $0.60 to $1.20 for LLM note generation, and $0.05 for storage and infrastructure, for a blended cost of around $0.85 to $1.45 per encounter. At 20 encounters per clinician per day and 250 working days per year, that is about $4,200 to $7,200 per clinician per year in raw cost.
Team to ship a v1. 2 ML engineers, 1 clinical informaticist (MD/NP, often part-time), 2 backend and integration engineers, 1 mobile engineer, 1 frontend engineer, 1 fractional CISO, and 1 product manager. Blended burn: $250K to $400K per month. Time to a credible MVP: 4 to 7 months.
Building an ambient AI scribe is one of the highest-leverage projects in healthcare right now, and it is also one of the most demanding. You are combining real-time audio, state-of-the-art speech recognition, clinical NLP, large language models, EHR integration, and healthcare-grade security into a single product that clinicians will use during the most important moments of their day.
If you are planning your own ambient AI scribe roadmap and want an experienced partner to accelerate architecture, vendor selection, and go-to-market, book a free strategy call and we will help you chart the fastest path to a production deployment.
Need help building this?
Our team has launched 50+ products for startups and ambitious brands. Let's talk about your project.