Technology·13 min read

How to Handle a Production Outage Without Losing Your Users

Every production system goes down eventually, including Google's and AWS's. What separates teams that keep users through an outage from teams that don't is not luck, it is process.

N

Nate Laquis

Founder & CEO ·

Outages Are Inevitable. Your Response Is Not.

In December 2021, AWS us-east-1 went down and took half the internet with it. In June 2019, Google Cloud suffered a network outage affecting multiple regions for over four hours. Cloudflare has had multiple global outages. If companies with hundreds of SREs and billions in infrastructure budgets still go down, your startup will too.

The question is never whether you will have a production outage. The question is whether you have a plan when it happens at 2am on a Saturday. Teams without a plan spend the first thirty minutes in chaos: wrong people paged, no one sure who is in charge, customers discovering the problem from Twitter before your team does. Teams with a plan are running diagnostics within five minutes and have a customer-facing update posted within fifteen.

User trust survives outages. What users do not forgive is silence. A company that communicates clearly during an incident, resolves it, and publishes a transparent post-mortem actually comes out with stronger trust than before the outage. That outcome requires a documented incident response process, built before anything breaks.

Server room with monitoring screens displaying infrastructure health and production system status

Phase 1: Detection (Minutes 0 to 5)

The faster you know about an outage, the faster you can fix it. Ideally, your monitoring system tells you before any user does. In practice, many startups discover outages when a customer tweets. The goal of Phase 1 is to close that gap.

Automated Monitoring as the First Line of Defense

Every production application needs three layers of monitoring. Synthetic uptime checks (Pingdom, Better Uptime, or AWS CloudWatch Synthetics) hit your endpoints every 60 seconds and alert within 2 minutes of a failure. These cost $20 to $50 per month and are the most valuable monitoring you can buy. Real user monitoring (Datadog RUM, New Relic Browser, or Sentry Performance) reports on actual user experience, including JavaScript errors, slow page loads, and failed API calls. Infrastructure monitoring (Datadog, Grafana Cloud, or AWS CloudWatch) tracks CPU, memory, disk, and network at the server level so you can spot resource exhaustion before it becomes an outage.

Your alerting pipeline matters as much as your monitoring tools. Alerts should go to PagerDuty or OpsGenie, which handle on-call rotation, escalation, and acknowledgment. Sending alerts only to Slack is a recipe for missed pages: Slack notifications are easily buried, and no one has clear accountability for responding.

Severity Classification

Not every alert is a P1. Define severity levels before anything breaks so your on-call engineer is not making judgment calls at 3am.

  • P1 (Critical): Production is down or completely unusable for all users. Core functionality is broken. Data loss is occurring or possible. Response time: 15 minutes or less. Page the on-call engineer and their backup immediately.
  • P2 (High): Major feature is broken, a significant portion of users are affected, or performance is severely degraded. Response time: 1 hour. Page the on-call engineer.
  • P3 (Medium): Non-critical feature broken, minor performance degradation, or an issue affecting a small subset of users. Response time: next business day. File a ticket.
  • P4 (Low): Cosmetic issues, minor bugs, or very low-impact problems. Response time: planned sprint. File a ticket.

Appointing an Incident Commander

The single most important structural decision in incident response is designating one person as Incident Commander (IC) for each incident. The IC is not necessarily the most senior engineer or the best debugger. The IC is the person responsible for coordinating the response, managing communication, and tracking time. They do not need to write code during the incident. Their job is to keep the team focused and moving. Without an IC, every engineer goes heads-down on their own theory, communication stops, and the incident drags on twice as long as it needs to.

Phase 2: Communication (Minutes 5 to 15)

While your engineers are beginning to diagnose the problem, communication needs to start immediately, in parallel. The instinct to wait until you have answers is wrong. Users want to know you know. Silence is the worst possible response to an outage.

Team dashboard showing incident communication channels and status page updates during production outage

Internal Communication: Your Incident Slack Channel

Create a dedicated Slack channel for the incident, named with the date and a brief description (e.g., #inc-2026-08-07-api-down). All incident-related communication happens in that channel. This keeps the noise out of your engineering channels and creates an automatic timeline you can reference in your post-mortem. The Incident Commander posts updates every 10 to 15 minutes, even if the update is just "still investigating, no new findings." This prevents stakeholders from interrupting engineers with status questions.

Invite your on-call engineer, their backup, the IC, the CTO or head of engineering, and anyone whose domain is likely affected. Do not invite the entire company to the channel during the incident, as it creates noise and slows down response.

External Communication: Your Status Page

Every production application needs a status page. Statuspage.io (from Atlassian) costs $29 to $99 per month and is the industry standard. Instatus is a cheaper alternative at $20 per month. If you are on Vercel, they have a built-in status page feature. The status page should be on a completely separate domain from your application so it stays up when your app is down.

Post an incident to your status page within 10 minutes of declaring a P1. Your first update does not need to have answers. It just needs to confirm that you are aware of the problem and actively investigating. Something like: "We are aware of an issue affecting [feature]. Our team is actively investigating. We will post an update in 30 minutes." This one post reduces inbound support tickets by 60 to 80% because users have somewhere to check that is not your support queue.

Customer-Facing Email for Major Incidents

For P1 incidents lasting more than 30 minutes, or incidents that involve data concerns, send a proactive email to affected customers. Do not wait for them to contact you. The email should state what happened, when it happened, what you did to fix it, and what you are doing to prevent recurrence. This email is often the difference between a churn event and a loyalty-building moment. Keep it simple, honest, and free of technical jargon.

Phase 3: Investigation and Resolution

With communication running and the Incident Commander keeping things organized, your engineers can focus on the actual problem. The key is a structured diagnostic checklist that prevents you from chasing theories randomly.

The First Question: What Changed?

The vast majority of production outages are caused by a recent change: a deployment, a configuration update, a database migration, or a third-party dependency update. Your first diagnostic step is always to check what changed in the 30 to 60 minutes before the incident started. Pull your recent deploy history. Check your feature flag configuration. Review recent database migrations. Look at infrastructure changes in your AWS console, Terraform state, or Kubernetes cluster. If you find a recent change that correlates with the start of the incident, roll it back first and ask questions second. Rolling back a bad deploy takes 3 minutes and resolves 40% of incidents immediately.

What Do the Logs Say?

If a rollback is not obvious or does not help, go to your logs. Structured logging with a tool like Datadog Logs, Papertrail, or AWS CloudWatch Logs Insights lets you query across millions of log lines in seconds. Search for ERROR and FATAL level messages in the window when the incident started. Look for new error patterns that did not exist before. Check for database query timeouts, external API failures, or out-of-memory errors. The log entry that starts the cascade is usually obvious once you are looking in the right time window.

Common Root Causes

  • Bad deploys: A code change introduced a bug, a syntax error, or an unhandled edge case. Fix: rollback. Then fix the bug in a feature branch, test it, and redeploy with a proper code review.
  • Database issues: Connection pool exhaustion, a missing index on a new query pattern, a long-running migration locking tables, or hitting storage limits. Fix: kill long-running queries, increase connection pool size, add the missing index, or free disk space. These are some of the most common P1 causes in growing applications.
  • External dependency failures: Your payment processor, email provider, authentication service, or a third-party API is down. Fix: implement circuit breakers so your app degrades gracefully instead of failing hard. In the immediate term, disable the affected feature or show a maintenance message for that specific function.
  • Traffic spikes: A Product Hunt launch, a media mention, or a marketing campaign sent more traffic than your infrastructure can handle. Fix: enable auto-scaling rules if they are not already active, temporarily raise rate limits, and consider adding a CDN cache layer in front of your API for public endpoints. If you have load tested your app (which you should), you already know your breaking point and can plan capacity ahead of these events.

Confirming Resolution

Before declaring an incident resolved, verify recovery with data, not intuition. Check your synthetic uptime monitors for green status. Verify that error rates in your application logs have returned to baseline. Confirm that response times are back to normal in your APM tool. Ask a team member who was not involved in the fix to do a manual smoke test of the affected features. Only after all of these checks pass should the IC close the incident and post a resolution update to the status page.

Phase 4: The Post-Incident Review

The post-incident review (also called a post-mortem) is where the long-term value of incident response is created. A well-run post-mortem turns a painful outage into a permanently stronger system. A skipped or blame-focused post-mortem turns it into just another bad day that will probably happen again.

The Blameless Post-Mortem

The blameless post-mortem, popularized by Google's SRE book, is a foundational principle: the goal of a post-mortem is to understand system failures, not to assign blame to individuals. Human error is always a symptom of a system problem, not the root cause. If an engineer deployed a bad configuration, the system problem is that there was no automated validation of that configuration before deployment. Blamelessness is not about excusing poor work. It is about fixing the system so the same class of error cannot recur regardless of who is on duty.

Post-Mortem Template

Run your post-mortem within 3 to 5 business days of the incident, while details are still fresh. A solid post-mortem document covers the following:

  • Incident summary: One paragraph. What happened, when, and what was the user impact. Write it for a non-technical audience.
  • Timeline: A chronological log of events, pulled from your incident Slack channel and monitoring tools. Include when the incident started, when it was detected, when the IC was paged, when diagnosis began, when the fix was applied, and when the incident was resolved.
  • Root cause analysis: The actual technical cause. Be specific. "The deploy at 14:23 UTC introduced a database query that lacked an index on the user_id column. Under normal load this query took 80ms. After the deploy, it took 12 seconds, exhausting the PostgreSQL connection pool and causing cascading 503 errors across all API endpoints."
  • Contributing factors: Systemic issues that allowed the root cause to cause an outage. Examples: no automated query performance testing, insufficient staging environment load, no alert on connection pool saturation.
  • What went well: Be honest here. The on-call engineer responded in 4 minutes. The rollback was clean. The communication was clear. Recognizing what worked reinforces good practices.
  • Action items: Specific, assigned, time-bound tasks to prevent recurrence. Each action item needs an owner and a due date. Vague action items ("improve monitoring") never get done. Specific ones ("add PagerDuty alert when PostgreSQL connection pool utilization exceeds 80%, assigned to Sarah, due August 21") get done.
Engineering team conducting blameless post-mortem review after resolving a production outage

Building Your Incident Response Playbook

An incident response playbook is the document your on-call engineer reads at 2am before doing anything else. It should be short, scannable, and opinionated. This is not a general reference document. It is a decision tree for high-stress situations.

On-Call Schedule

Define who is on-call, when, and what their responsibilities are. Use PagerDuty or OpsGenie to manage the rotation. For a small team (3 to 6 engineers), a weekly rotation works well. Each engineer is primary on-call for one week at a time, with a backup rotation that lags by one week. Define explicitly what being on-call means: carry your phone, respond to P1 pages within 15 minutes, respond to P2 pages within 60 minutes. Pay your on-call engineers a stipend ($100 to $300 per week is typical for startups) to compensate them for the on-call burden and to make the schedule feel fair.

Runbooks for Common Failure Modes

A runbook is a step-by-step guide for resolving a specific known failure. Write runbooks for the failure modes you have already experienced and for the ones you can anticipate. A runbook for "PostgreSQL connection pool exhaustion" should include: how to identify the issue (what the alert looks like, what the logs say), immediate remediation steps (kill long-running queries, restart PgBouncer, temporarily scale down non-critical workers to reduce connection demand), and long-term fix steps (add index, tune pool size). Runbooks reduce mean time to resolution by 50 to 70% for known issues because the engineer does not have to figure out the diagnosis from scratch.

Store your runbooks in Notion, Confluence, or a dedicated runbooks directory in your GitHub repository. Link them directly from your PagerDuty alerts so the on-call engineer sees the runbook URL in their page notification. The goal is zero friction between "I got paged" and "I know what to do."

Communication Templates

Pre-write your communication templates so you are not drafting from scratch during an incident. You need four templates: initial incident acknowledgment for the status page, 30-minute update with "still investigating" language, resolution notice with root cause summary, and the customer email for major incidents. Store these templates somewhere your entire team can access them, not just your on-call engineer. Your head of customer success may need to post the status page update while your engineers are focused on the fix.

Preventing Repeat Incidents

Post-mortems identify what to fix. Systematic engineering practices prevent entire categories of incidents from happening in the first place. Here are the four highest-leverage preventive investments for startups.

Pre-Deployment Checks

Most deploy-caused incidents are preventable with automated checks before code reaches production. At minimum, run automated tests (unit, integration, and a smoke test against a staging environment), a database migration dry-run that checks for locking operations on large tables, a dependency vulnerability scan (Snyk or GitHub Dependabot), and a brief manual smoke test by the deploying engineer. Use deployment windows: avoid deploying in the hour before or after peak traffic, on Fridays, or before holidays. It sounds obvious, but half of the Friday-night incidents we see are caused by a "quick fix" pushed at 4:30pm.

Feature Flags

Feature flags let you deploy code without activating it, then turn it on for 1% of users, then 10%, then 100%. If the 1% rollout causes errors, you flip the flag off and no customers are affected. LaunchDarkly is the enterprise standard at $75 to $200 per month for small teams. Unleash is a solid open-source alternative you can self-host. Even a simple environment variable-based feature flag system is better than none. Feature flags are the single best tool for reducing the risk of each individual deployment.

Circuit Breakers

A circuit breaker is a pattern that stops calling an external dependency when it starts failing, preventing cascading failures. Without circuit breakers, if your payment API goes down, every user checkout attempt waits 30 seconds for a timeout, exhausting your server threads and taking down your entire application, not just checkouts. With a circuit breaker, after 5 consecutive failures the circuit opens, and your app immediately returns a clear error for payment attempts while the rest of your application continues normally. Implement circuit breakers for every external service call: payment processors, email providers, SMS providers, third-party APIs, and even internal microservices. Libraries like opossum (Node.js) and resilience4j (Java) make implementation straightforward.

Auto-Scaling with Headroom

Auto-scaling policies that only kick in when CPU hits 90% are too slow. By the time new instances provision (2 to 4 minutes on AWS), your existing instances have been overloaded long enough to cause timeouts and errors. Set your scale-out policy to trigger at 60 to 70% CPU utilization. Maintain a minimum instance count that can handle at least 150% of your typical peak traffic, so you have buffer before auto-scaling even needs to activate. The cost difference between running 2 instances at 40% CPU versus 1 instance at 80% is typically $50 to $150 per month. That is cheap insurance against traffic-spike outages.

Building reliable systems that handle outages gracefully is a core part of what we do at Kanopy Labs. If your team does not have incident response processes in place, or if you want to pressure-test your current setup, we can help. Book a free strategy call to talk through your reliability needs.

Need help building this?

Our team has launched 50+ products for startups and ambitious brands. Let's talk about your project.

production outageincident responsesite reliabilityon-call engineeringDevOps

Ready to build your product?

Book a free 15-minute strategy call. No pitch, just clarity on your next steps.

Get Started