Why Webhooks Are Critical for SaaS Integration
Every SaaS product eventually needs to notify the outside world when something happens. A payment succeeds. A user upgrades. A document gets signed. A deploy finishes. The question is how you deliver that notification. You have two options: your customers poll your API on a schedule, or you push events to their endpoints in real time via webhooks.
Polling is wasteful, slow, and expensive. If a customer polls your API every 60 seconds for order updates, that is 1,440 requests per day per customer. Multiply by 500 customers, and you are fielding 720,000 API requests daily that return "no new data" 99% of the time. Your rate limits get hammered. Your infrastructure costs spike. And your customers still experience up to 60 seconds of latency between when an event happens and when they learn about it.
Webhooks flip this model. Your system pushes an HTTP POST to a registered URL the moment an event occurs. Latency drops from minutes to milliseconds. Your API load decreases dramatically. Your customers get data when it matters, not on some arbitrary polling interval. Stripe, GitHub, Shopify, Twilio, and every major platform uses webhooks as the primary integration mechanism for good reason.
But here is the catch: webhooks are inherently unreliable. You are making an HTTP request to a server you do not control. That server might be down for maintenance. It might be behind a load balancer that is returning 502 errors during a deploy. The network path between your infrastructure and theirs might drop packets. Their handler might time out because it is doing too much work synchronously. According to data from Svix and Hookdeck, roughly 15% of webhook deliveries fail on the first attempt across the industry. That number is not a bug. It is a fundamental property of distributed systems.
The difference between a production-grade webhook system and a fragile prototype is how you handle that 15%. The patterns in this guide will take you from "fire and forget" to a system that guarantees delivery, provides full observability, and gives your customers the confidence to build critical workflows on top of your events.
Reliability Patterns: Guaranteeing Delivery
The foundation of any serious webhook architecture is the delivery guarantee. You need to decide: do you guarantee at-most-once delivery (we try once, if it fails, the event is lost), at-least-once delivery (we keep retrying until we get a success response, but you might receive duplicates), or exactly-once delivery (each event arrives precisely once). In practice, exactly-once is impossible in distributed systems without cooperation from the consumer, so the industry standard is at-least-once delivery combined with idempotency keys.
At-Least-Once Delivery
At-least-once means your system will retry failed deliveries until the consumer acknowledges receipt with a 2xx HTTP response. The consumer might receive the same event more than once if their server processes the webhook but crashes before returning the response. This is an acceptable tradeoff because the consumer can deduplicate using the event ID you include in every payload.
To implement at-least-once delivery, you need persistent storage for pending events. Never fire webhooks directly from your application code. Instead, write the event to a durable queue (PostgreSQL, SQS, Redis Streams) and let a separate worker process handle delivery. This decouples event generation from delivery, so a downstream failure never blocks your main application flow.
Idempotency Keys
Every webhook payload should include a unique event ID (sometimes called an idempotency key). The consumer stores these IDs and checks them before processing. If they have already seen an event with that ID, they skip it. Your payload might look like this:
- event_id: A UUID v4 generated at event creation time, not at delivery time. If you retry the same event, the ID stays the same.
- event_type: A dot-notation string like "invoice.paid" or "subscription.canceled" that tells the consumer which handler to invoke.
- created_at: ISO 8601 timestamp of when the event was generated, so consumers can detect and handle out-of-order delivery.
- data: The actual payload, containing the relevant resource state at the time the event was created.
Exponential Backoff with Jitter
When a delivery fails, you do not retry immediately. Hammering a struggling endpoint makes things worse. Instead, you use exponential backoff: wait 1 minute, then 5 minutes, then 30 minutes, then 2 hours, then 8 hours, then 24 hours. Each retry interval roughly doubles or triples the previous one.
But pure exponential backoff creates thundering herd problems when multiple events fail simultaneously (common during a consumer's deploy window). All retries would fire at the same calculated time. Adding jitter solves this. For each retry, you compute the backoff interval and then add a random offset (typically 0 to 25% of the interval). This spreads retries across time and prevents burst traffic from overwhelming a recovering endpoint.
A typical retry schedule looks like: attempt 1 at T+0, attempt 2 at T+1m (with jitter), attempt 3 at T+5m, attempt 4 at T+30m, attempt 5 at T+2h, attempt 6 at T+8h, attempt 7 at T+24h. After the final attempt, the event moves to a dead letter queue for manual review or automated alerting. Most webhook providers give up after 24 to 72 hours of retries.
Security: Protecting the Webhook Pipeline
Webhooks introduce a unique security surface. You are sending potentially sensitive data to URLs that customers register. Those URLs point to servers you do not control. And on the receiving side, customers need to verify that incoming requests actually came from your system and were not spoofed by an attacker. Security is not optional here. A single leaked webhook payload could expose customer PII, financial data, or internal system state.
HMAC Signature Verification
The standard approach is HMAC-SHA256 signing. When a customer registers a webhook endpoint, you generate a shared secret (a random 32-byte string, base64 encoded). You store this secret, and the customer stores it on their end. For every webhook delivery, you compute an HMAC-SHA256 signature of the raw request body using this shared secret and include it in a header (typically X-Webhook-Signature or Webhook-Signature if following the Standard Webhooks spec).
The consumer recomputes the HMAC on their side using the same secret and the raw body they received. If the signatures match, the request is authentic. If they do not match, the request was tampered with or forged. This is the same mechanism Stripe, GitHub, and Shopify use. It is battle-tested and computationally cheap.
One critical implementation detail: always use a constant-time string comparison function when verifying signatures. A naive string comparison that short-circuits on the first mismatched character leaks timing information that attackers can exploit to forge valid signatures byte by byte.
Webhook Secret Rotation
Secrets get compromised. Employees leave. Logs accidentally capture headers. You need a rotation mechanism that does not cause downtime. The standard pattern is dual-secret support: when a customer rotates their secret, you keep both the old and new secret active for a grace period (typically 24 to 72 hours). During this window, you sign with the new secret and include a secondary signature with the old secret. The consumer can verify against either. After the grace period, you deactivate the old secret.
IP Allowlisting
For enterprise customers with strict network policies, publish a static list of IP addresses that your webhook delivery system uses. This lets customers configure their firewalls to accept traffic only from your known IPs. The downside: if you change infrastructure (migrate cloud providers, add regions), you need to coordinate IP changes with every customer who uses allowlisting. Publish your IPs in a well-known location (a JSON file at a stable URL, like /.well-known/webhooks/ips.json) and give customers advance notice before changes.
Payload Encryption
HMAC verification proves authenticity but does not encrypt the payload. The body travels in plaintext (over TLS, of course, but decrypted at every TLS termination point). For customers handling highly sensitive data, consider offering optional payload encryption using the customer's public key. They decrypt on their end with their private key. This adds complexity but provides end-to-end confidentiality even if TLS is terminated at a CDN or reverse proxy.
The Delivery Pipeline: Architecture Deep Dive
A production webhook delivery pipeline has several distinct stages. Each stage is independently scalable and observable. Skipping any of these stages leads to the kind of brittleness that wakes you up at 3 AM.
Event Generation and Queuing
Your application code generates events in response to state changes. A payment processor confirms a charge, so you emit a "payment.succeeded" event. A user updates their profile, so you emit a "user.updated" event. These events go into a durable queue, not directly to consumer endpoints. The queue serves as the system of record for what happened. PostgreSQL with a dedicated events table works for lower volumes (under 10,000 events per hour). For higher throughput, use Amazon SQS, Google Cloud Pub/Sub, or Redis Streams.
The event record should contain: the event ID, event type, timestamp, the full payload at the time of generation (not a reference to look up later, because the resource state might change), and a list of target endpoints that should receive this event based on customer subscriptions.
Worker Pool and Delivery
A pool of worker processes pulls events from the queue and attempts HTTP delivery. Each worker makes a POST request to the consumer's endpoint with a timeout of 5 to 30 seconds (configurable, but 15 seconds is a reasonable default). The worker sets appropriate headers: Content-Type, the HMAC signature, a unique delivery attempt ID, and the event timestamp.
Workers should be stateless and horizontally scalable. If you are processing 1,000 events per second and each delivery takes an average of 200ms, you need at least 200 concurrent workers. In practice, use a worker framework that manages concurrency for you: BullMQ for Node.js, Celery for Python, or Sidekiq for Ruby. All three handle job queuing, concurrency limits, and retry scheduling out of the box.
Retry Logic and State Machine
Each delivery attempt transitions through states: pending, in-flight, succeeded, failed-will-retry, and failed-permanently. After a failed delivery (non-2xx response or timeout), the worker schedules a retry based on the backoff schedule and moves the event to "failed-will-retry." After exhausting all retry attempts, it moves to "failed-permanently" and lands in the dead letter queue.
Track the HTTP status code and response body from each attempt. This data is invaluable for debugging. If a consumer returns a 401, their auth configuration is wrong. If they return a 500, their handler is crashing. If the request times out, they are doing too much work in the handler. Surfacing this information in your webhook management UI helps customers self-serve instead of filing support tickets.
Dead Letter Queue
Events that exhaust all retry attempts need somewhere to go. The dead letter queue (DLQ) is a holding area for permanently failed deliveries. It should be queryable by customer, endpoint, event type, and time range. Your operations team monitors the DLQ for patterns. If a single customer's endpoint accumulates hundreds of dead letters, something is fundamentally broken on their end, and your system should proactively disable that endpoint and notify the customer rather than continuing to burn compute on deliveries that will never succeed.
Customers should also be able to replay events from the DLQ through your webhook management UI. Once they fix their endpoint, they click "replay all" and every failed event gets re-delivered in order. This is one of the most requested features by enterprise customers and a strong differentiator if you are building webhook infrastructure.
Monitoring, Alerting, and Observability
You cannot operate what you cannot observe. A webhook delivery system without proper monitoring is a black box that silently drops events until a customer notices and files an angry support ticket. Here are the metrics and alerting patterns that keep you ahead of problems.
Delivery Success Rates
Track the first-attempt success rate and the overall success rate (including retries) as separate metrics. Your first-attempt rate will hover around 85% industry-wide. Your overall rate (after retries) should be above 99.5%. If it drops below that, you have a systemic issue: either your retry logic is insufficient, or a large customer's endpoint is persistently failing and dragging down the aggregate.
Break these metrics down by customer, endpoint, and event type. A global success rate of 99% might hide the fact that one specific customer's endpoint is failing 80% of the time. Per-endpoint dashboards catch this immediately.
Latency Percentiles
Track p50, p95, and p99 delivery latency (time from event generation to successful delivery acknowledgment). Your p50 should be under 2 seconds. Your p95 should be under 10 seconds. Your p99 will be higher because it includes events that required one or more retries. Set alerts when p95 crosses 30 seconds, because that usually indicates a consumer endpoint is consistently slow and you may need to adjust timeouts or reach out.
Failed Endpoint Detection
Implement a circuit breaker pattern for consumer endpoints. If an endpoint fails 5 consecutive deliveries (or drops below a 50% success rate over 100 attempts), mark it as "unhealthy." Unhealthy endpoints get a reduced retry cadence to avoid wasting resources. After a cooldown period, send a health check probe. If the probe succeeds, resume normal delivery. If it fails, extend the cooldown.
Critically, notify the customer when their endpoint enters an unhealthy state. Send an email, trigger an in-app notification, or (ironically) send a webhook to a secondary "alert" endpoint they register specifically for operational notifications. Svix calls these "operational webhooks" and they are one of the most valuable features in their platform.
Alerting for Your Operations Team
Set up alerts for: DLQ depth exceeding a threshold (more than 100 events stuck), global success rate dropping below 95%, any single endpoint accumulating more than 50 consecutive failures, and delivery latency p95 exceeding 60 seconds. Route these to PagerDuty or Opsgenie with appropriate severity levels. DLQ depth and global success rate are page-worthy. Individual endpoint failures are informational unless they affect a high-value customer.
Also monitor your own infrastructure: queue depth (are events backing up faster than workers can process them?), worker error rates (are workers crashing?), and memory usage on delivery workers (large payloads can cause OOM kills). These operational metrics catch problems before they manifest as customer-facing failures.
Webhook Management UI for Customers
A webhook system is only as good as the self-service experience you provide to your customers. If they need to contact support every time they want to change an endpoint URL, view delivery history, or replay a failed event, you have built half a system. The management UI is what separates "we support webhooks" from "webhooks are a first-class feature of our platform."
Endpoint Registration
Customers need to register one or more endpoint URLs where they want to receive events. The registration flow should include: URL validation (check that it is a valid HTTPS URL), a verification step (send a test event and confirm receipt), and the ability to label endpoints for easy identification. Many customers run separate endpoints for staging and production environments, so support multiple endpoints per customer.
During registration, generate and display the webhook secret exactly once. Show it in a copyable field and warn the customer that they will not be able to see it again. Provide a "rotate secret" button that initiates the dual-secret rotation process described earlier.
Event Filtering
Not every customer wants every event type. A payment processor integration might only care about "payment.succeeded" and "payment.failed," ignoring "customer.updated" or "subscription.trial_ending." Let customers select which event types each endpoint receives. This reduces noise, saves bandwidth, and makes their handler code simpler because they do not need to filter on their end.
Advanced filtering goes further: let customers filter by resource attributes. For example, "only send invoice.paid events where the amount is greater than $1,000." This level of filtering is complex to implement but dramatically reduces unnecessary webhook traffic for customers with high event volumes. If you are building this from scratch, consider using a simple expression language like JSONPath or a subset of MongoDB query syntax.
Delivery Logs and Manual Replay
Every delivery attempt should be logged and visible to the customer: timestamp, event type, HTTP status code returned, response body (truncated), and duration. This lets customers debug integration issues without contacting your support team. If their endpoint returned a 500, they can inspect the error, fix their handler, and replay the event.
Manual replay is the killer feature. A customer deploys a buggy handler, it crashes on 50 events, they fix the bug, and they hit "replay all failed events in the last 24 hours." The events get re-delivered in chronological order. Without this feature, those 50 events are simply lost, and the customer has to build reconciliation logic to recover the missed data from your API. Replay turns a potential data loss incident into a minor inconvenience.
For implementation guidance on the underlying event-driven architecture that powers these replay systems, the patterns are well-established across message queue technologies.
Build vs Buy: The Real Cost Calculus
Every engineering team asks the same question: should we build webhook infrastructure in-house or use a managed service? The answer depends on your event volume, your team's operational capacity, and how central webhooks are to your product's value proposition. Here is an honest breakdown.
Building Custom with BullMQ or SQS
If you are on Node.js, BullMQ gives you a Redis-backed job queue with built-in retry logic, exponential backoff, rate limiting, and concurrency controls. You can build a basic webhook delivery system in 2 to 3 weeks of engineering time. Add another 2 weeks for the customer-facing management UI. Total: roughly 4 to 5 weeks of a senior engineer's time, plus ongoing maintenance.
The hidden costs emerge after launch. You need monitoring dashboards, alerting rules, the dead letter queue UI, signature verification libraries for your customers, documentation for the webhook format, and ongoing operational burden. Every time a customer reports a missed webhook, someone on your team investigates. Budget 10 to 20% of one engineer's time for ongoing maintenance at moderate scale (100K events per month). At higher scale, it becomes a larger tax.
Using AWS SQS or Google Cloud Tasks as the backing queue instead of Redis gives you better durability guarantees and less operational surface area, but the application layer work (delivery logic, retry scheduling, customer UI) remains the same.
Svix: $500/mo for Managed Sending
Svix's cloud offering starts with a free tier of 50,000 messages per month. The Business plan at $500/month gives you 500,000 messages, priority support, custom branding for the embeddable portal, and advanced features like transformations and operational webhooks. Enterprise pricing is custom for higher volumes.
What you get: battle-tested delivery infrastructure, a pre-built customer management portal, signature verification libraries in every major language, and an operations team that is not yours. What you give up: full control over the delivery pipeline and the ability to customize retry behavior beyond what Svix exposes. For most SaaS products, Svix is the right choice. You eliminate 4 to 6 weeks of upfront build time and an ongoing operational burden that compounds as you scale.
Hookdeck: Ingestion and Transformation
Hookdeck occupies a different niche. It sits between webhook senders and your application, handling ingestion, queuing, transformation, and reliable delivery to your internal systems. If you are primarily a webhook consumer (receiving webhooks from Stripe, GitHub, Shopify, etc.), Hookdeck normalizes the chaos of different providers into a consistent, reliable pipeline. Pricing starts at $0 for 5,000 events per month, with paid plans from $25/month scaling based on volume.
Convoy: Open-Source Gateway
Convoy is fully open-source and self-hosted. You run it on your own infrastructure, which means zero per-message costs but full operational responsibility. It handles retries, rate limiting, circuit breaking, and provides a management dashboard. Convoy is the right choice if you have strict data sovereignty requirements, want full control, and have the DevOps capacity to run another stateful service. Compare this against the API-first development approach where you architect your integration layer for extensibility from day one.
Decision Framework
- Under 50K events/month, small team: Use Svix free tier or Hookdeck free tier. Do not build custom infrastructure at this scale.
- 50K to 500K events/month, growing SaaS: Svix Business at $500/month or Hookdeck paid plan. The cost is less than a single day of engineer time per month.
- Over 500K events/month, enterprise requirements: Evaluate Svix Enterprise, Convoy self-hosted, or custom build. At this scale, the operational knowledge exists on your team to justify custom infrastructure if needed.
- Data sovereignty or air-gapped environments: Convoy self-hosted or Svix self-hosted (open-source Rust server). No managed service will satisfy strict compliance requirements.
Standard Webhooks Specification and What Comes Next
The webhook ecosystem has historically been fragmented. Every provider implements webhooks slightly differently: different header names for signatures, different retry schedules, different payload formats. The Standard Webhooks specification (standardwebhooks.com) aims to fix this by defining a common set of conventions that webhook senders and consumers can follow.
What the Spec Defines
Standard Webhooks specifies: the webhook-id header containing a unique message identifier, the webhook-timestamp header containing a Unix timestamp for replay attack prevention, and the webhook-signature header containing one or more HMAC-SHA256 or ed25519 signatures. The spec also defines the signing algorithm (HMAC-SHA256 of the concatenation of message ID, timestamp, and body), the verification process, and the tolerance window for timestamp validation (typically 5 minutes).
Adopting Standard Webhooks means your customers can use generic webhook verification libraries instead of writing custom code for your specific implementation. It also means tools like Svix, Hookdeck, and generic webhook testing services (like webhook.site or RequestBin) work with your webhooks out of the box.
Implementation Checklist
If you are building a webhook system today, here is the minimum viable implementation that follows industry best practices:
- Follow Standard Webhooks spec for header names and signature format. This costs you nothing and makes your webhooks compatible with existing tooling.
- Include event_id, event_type, and created_at in every payload for idempotency and debugging.
- Implement exponential backoff with jitter over a 24-hour retry window. Seven attempts with increasing intervals is the standard.
- Provide signature verification libraries in at least JavaScript, Python, and Go. Or point customers to the Standard Webhooks verification libraries.
- Build a delivery log visible to customers showing attempt history, status codes, and response bodies.
- Implement replay functionality so customers can recover from handler bugs without data loss.
- Set up circuit breakers that disable persistently failing endpoints and notify customers proactively.
- Publish your IP addresses for enterprise customers who need allowlisting.
The Future: Webhook Subscriptions and CloudEvents
Two related specifications are gaining traction. The WebSub protocol (W3C) defines a subscription mechanism where consumers register interest in specific topics through a hub. CloudEvents (CNCF) standardizes the envelope format for event data across different transport protocols (HTTP, AMQP, Kafka, MQTT). Both aim to reduce the integration burden, but neither has achieved the critical mass needed to replace custom webhook implementations in most SaaS products.
For now, the pragmatic path is: follow Standard Webhooks for signing and headers, use a simple JSON payload format with clear event types, and focus your energy on reliability and observability rather than specification compliance beyond what is proven in production.
If you are building a SaaS product and need reliable webhook infrastructure without the 4 to 6 week engineering investment, we can help you architect the right solution. Book a free strategy call and we will map out the webhook architecture that fits your scale, compliance needs, and budget.
Need help building this?
Our team has launched 50+ products for startups and ambitious brands. Let's talk about your project.