Why Webhook Delivery Is Harder Than It Looks
Every SaaS founder has the same experience. You add a webhook feature to your product in a couple of days. A simple HTTP POST when something happens. It works great in development and staging. Then you ship it, a few hundred customers subscribe to events, and everything starts to crack. Endpoints go down. Responses time out. Payloads get delivered twice. Customers complain that they missed critical events, and you have no visibility into what went wrong.
The core problem is that webhooks are a distributed systems problem disguised as a simple HTTP call. You are making a contract with your customers: when X happens in your system, we will notify your system. That contract implies reliability, ordering, and consistency. Delivering on those guarantees requires careful architecture. Most teams underestimate this until they are dealing with angry enterprise customers who lost data because a webhook was silently dropped.
We have built webhook delivery systems for multiple SaaS products, and the pattern is always the same. Version one is naive (fire and forget). Version two adds retries. Version three adds logging and monitoring. Version four is a complete rewrite because the first three versions created a mess of edge cases. This guide skips straight to version four. You will get the architecture decisions, trade-offs, and implementation patterns that hold up under real production load.
If you are building a SaaS product that other developers integrate with, webhook reliability is not optional. It is a competitive differentiator. Stripe, GitHub, and Shopify all invest heavily in their webhook infrastructure because they know that flaky delivery erodes developer trust. Your product should do the same.
Delivery Guarantees: At-Least-Once vs Exactly-Once
Before writing any code, you need to decide on your delivery guarantee. This is the single most important architectural decision for your webhook system, and it cascades into every other choice you will make.
At-least-once delivery means your system guarantees that every event will be delivered to the subscriber at least one time, but it might be delivered more than once. This is the standard for webhook systems and is what Stripe, GitHub, Twilio, and nearly every major API provider offers. It is achievable, testable, and well-understood. You retry on failure until you get a success response or exhaust your retry budget.
Exactly-once delivery is the holy grail that does not exist in distributed systems without cooperation from both sides. You cannot guarantee that a remote server received and processed your payload exactly once. The network might drop the response after the server processed the request. The server might crash after writing to its database but before sending the 200 response. In both cases, your system sees a failure and retries, resulting in a duplicate delivery. True exactly-once semantics require the consumer to implement idempotency on their end.
The practical approach is to build at-least-once delivery on your side and provide your consumers with the tools they need to achieve exactly-once processing. That means including an idempotency key (a unique event ID) in every webhook payload. Your documentation should clearly state: "We guarantee at-least-once delivery. Use the event_id field to deduplicate on your end." This is honest, achievable, and matches industry standards.
Here is what the payload structure should look like:
- event_id: A globally unique identifier (UUIDv7 works well because it is time-sortable). Consumers use this to deduplicate.
- event_type: A namespaced string like
invoice.paidorsubscription.cancelled. Use past tense for events that already happened. - created_at: ISO 8601 timestamp of when the event was created in your system.
- data: The actual payload object with the relevant resource state.
- api_version: The webhook schema version, so consumers can handle payload changes gracefully.
This structure gives consumers everything they need to process events reliably. The event_id handles deduplication, the timestamp enables ordering checks, and the version field future-proofs the integration against schema changes.
Retry Strategies and Dead Letter Queues
Your retry strategy is the backbone of reliable delivery. Get it wrong and you will either hammer failing endpoints into oblivion or give up too early and drop events. The industry standard is exponential backoff with jitter, and there are good reasons it became the standard.
Exponential backoff spaces out retries with increasing delays. A typical schedule looks like this: first retry after 30 seconds, then 2 minutes, then 10 minutes, then 1 hour, then 4 hours, then 12 hours. This gives transient failures time to resolve while not abandoning the delivery. Most temporary outages resolve within the first few retries. Infrastructure issues that take down an endpoint for more than 12 hours are rare, but they happen, so you want a long tail on your retry schedule.
Jitter is critical and often overlooked. Without jitter, if an endpoint goes down and 500 webhooks queue up, all 500 will retry at the exact same time. This creates a thundering herd that can overwhelm the recovering endpoint and cause it to fail again. Adding randomized jitter (plus or minus 20% of the backoff interval) spreads the retries across a window, preventing this pile-on effect. The formula is simple: delay = base_delay * 2^attempt * (0.8 + random() * 0.4).
You also need to classify response codes to determine retry behavior:
- 2xx responses: Delivery succeeded. Mark as delivered and move on.
- 4xx responses (except 429): The consumer rejected the payload. Do not retry, as the consumer is telling you the request is invalid. Log it and alert the customer.
- 429 (Too Many Requests): Respect the
Retry-Afterheader if present. Otherwise, back off aggressively. - 5xx responses: The consumer is having problems. Retry with your backoff schedule.
- Timeouts: Treat as a 5xx. Set your timeout at 30 seconds, which is generous for a webhook receiver.
- Connection errors: DNS failures, connection refused, TLS errors. Retry with backoff, but if TLS errors persist, alert the customer that their endpoint certificate may be misconfigured.
Dead letter queues (DLQs) catch events that exhaust their retry budget. After your final retry attempt, the event goes into a DLQ rather than being silently dropped. This is non-negotiable. Dropping events with no record is the worst possible outcome. Your DLQ should store the full event payload, all delivery attempt timestamps and responses, the endpoint URL, and any error details. Give customers a UI to view their DLQ and manually replay failed events. Svix and Hookdeck both provide this out of the box, which is one reason to consider a managed solution.
One pattern we have found valuable: after three consecutive failures to an endpoint, automatically disable the subscription and send the customer an email notification. This prevents your system from wasting resources retrying against an endpoint that was decommissioned. The customer can re-enable the subscription and replay missed events from the DLQ once their endpoint is healthy.
Payload Signing, Security, and Idempotency Keys
If you are sending webhooks without signing the payloads, your customers have no way to verify the requests actually came from your system. Anyone who discovers the endpoint URL can forge webhook payloads. For SaaS products handling financial data, user information, or any sensitive operations, this is a serious security gap.
HMAC-SHA256 signing is the standard approach. When a customer registers a webhook endpoint, you generate a shared secret (at least 32 bytes of cryptographically random data). For every webhook delivery, you compute an HMAC-SHA256 signature of the raw request body using that secret and include it in a header. The customer verifies the signature on their end before processing the payload. If the signature does not match, the payload was tampered with or forged.
Your signing implementation should follow these patterns:
- Include a timestamp in the signed content (not just the body). Sign
timestamp.bodyto prevent replay attacks. The consumer checks that the timestamp is within a 5-minute tolerance window. - Use a dedicated header like
X-Webhook-Signaturefor the signature andX-Webhook-Timestampfor the timestamp. - Support secret rotation by allowing two active secrets simultaneously. When a customer rotates their secret, both the old and new secret are valid for a transition period (typically 24 to 48 hours).
- Provide verification libraries or code examples in your customers' most common languages. Stripe does this exceptionally well with their
stripe.webhooks.constructEvent()helper.
Idempotency keys are your consumers' best friend. Every webhook event should include a unique, stable identifier that does not change across retries. If the same event is delivered three times due to retries, all three deliveries carry the same event_id. The consumer stores processed event IDs in a set (Redis with a TTL works well) and skips any duplicates. Document this pattern prominently. Provide sample code. The easier you make idempotent processing, the fewer support tickets you will handle about duplicate events.
One additional security consideration: validate that customer endpoint URLs use HTTPS. Reject plain HTTP endpoints in production. Webhook payloads often contain sensitive data, and delivering them over unencrypted connections is a liability. Some products allow HTTP for development and localhost URLs but enforce HTTPS for any public-facing endpoint.
Database-Backed vs Queue-Backed Architectures
The architectural divide in webhook delivery systems comes down to where you store and process your event queue. Both approaches work, but they suit different scales and team capabilities. Choosing wrong means a painful migration later.
Database-backed (PostgreSQL LISTEN/NOTIFY or polling)
This approach uses your existing PostgreSQL database as both the event store and the dispatch mechanism. When an event occurs, you insert a row into a webhook_events table. A worker process picks up pending events, attempts delivery, and updates the row with the result. PostgreSQL LISTEN/NOTIFY can trigger workers in near-real-time without polling, keeping latency low.
Advantages: no additional infrastructure to manage, full ACID guarantees on event creation (the event and the business action that triggered it can share a database transaction), and simple debugging since everything is queryable with SQL. For a SaaS product delivering fewer than 10,000 webhooks per hour, this is often the right call. We have used this pattern successfully for several clients, and it is refreshingly simple to operate.
Disadvantages: PostgreSQL was not designed as a job queue. At high throughput, row-level locking on the events table becomes a bottleneck. LISTEN/NOTIFY does not survive database restarts, and notifications can be lost under heavy load. You also need to build your own retry scheduling, which means cron jobs or polling loops that add complexity.
Queue-backed (Redis Streams, SQS/SNS, BullMQ, Inngest)
This approach uses a dedicated message queue or workflow engine to handle dispatch and retries. When an event occurs, you publish it to a queue. Workers consume from the queue, attempt delivery, and use the queue's built-in retry mechanics for failures. Event-driven architecture principles apply here: producers and consumers are decoupled, and the queue handles backpressure naturally.
For Node.js/TypeScript stacks, BullMQ on top of Redis is a strong choice. It provides delayed jobs (perfect for exponential backoff), job prioritization, rate limiting per queue, and a dashboard (Bull Board) for monitoring. Inngest is worth considering if you want a managed workflow engine. It handles retries, scheduling, and observability out of the box, and its step-function model maps cleanly to webhook delivery workflows.
For AWS-native stacks, SQS with a dead-letter queue configuration gives you retry behavior and DLQ management with minimal code. SNS fans out events to multiple SQS queues if you need to process the same event in different ways (delivery, analytics, audit logging).
Redis Streams offer a middle ground. They are more capable than pub/sub (they persist messages and support consumer groups) but lighter than a full SQS setup. If you are already running Redis for caching, Streams add webhook processing without new infrastructure.
Our recommendation: start with PostgreSQL-backed delivery if you are early stage and processing low to moderate volume. Migrate to BullMQ or Inngest when you hit consistent delivery volumes above 5,000 events per hour or when your retry logic becomes complex enough to warrant a dedicated system. Do not start with SQS/SNS unless your entire stack is already on AWS and your team is comfortable with the operational overhead.
Monitoring, Rate Limiting, and Circuit Breakers
A webhook system without monitoring is a webhook system that is failing silently. You need visibility into delivery health at three levels: per-endpoint, per-customer, and system-wide. Without this, you will discover problems only when customers complain, and by then the damage is done.
Delivery monitoring and alerting
Track these metrics for every webhook endpoint:
- Delivery success rate (target: above 99.5% for healthy endpoints)
- Average and p95 response times from consumer endpoints
- Number of events in retry queue per endpoint
- DLQ depth (events that exhausted all retries)
- Time from event creation to successful delivery (end-to-end latency)
Set up alerts when an endpoint's success rate drops below 95%, when DLQ depth exceeds a threshold, or when end-to-end delivery latency spikes. Use Datadog, Grafana, or even a simple Slack integration for alerting. The key is that someone (either your team or the customer) gets notified before a small problem becomes a large one.
Build a customer-facing webhook dashboard. This is not optional for B2B SaaS. Your customers need to see their delivery history, inspect payloads, view error responses, and replay failed events. Stripe's webhook dashboard is the gold standard. At minimum, provide: a list of recent deliveries with status codes, the ability to inspect request and response bodies, a button to retry individual failed events, and endpoint health status over time. If you use Svix, Hookdeck, or Convoy, you get a pre-built management UI that you can embed in your product.
Rate limiting
You need rate limiting on two dimensions. First, limit outbound delivery rate per endpoint to prevent overwhelming your customers' servers. A reasonable default is 100 requests per second per endpoint, configurable by the customer. Second, limit inbound event creation to prevent a bug in your system from generating millions of spurious events. Use a token bucket or sliding window algorithm, and make the limits transparent to your customers.
Circuit breakers
When an endpoint fails consistently, a circuit breaker pattern prevents your system from wasting resources on a dead endpoint. After N consecutive failures (we use 10 as a default), the circuit opens. While open, new events for that endpoint are queued but not delivered. The circuit moves to half-open after a cooldown period (15 minutes), allowing a single test delivery through. If it succeeds, the circuit closes and normal delivery resumes. If it fails, the circuit stays open for another cooldown period.
This protects both your infrastructure (no wasted compute on guaranteed failures) and your customer's infrastructure (no flood of retries when they bring their endpoint back online). Notify the customer via email when their circuit opens so they can investigate and fix their endpoint.
Build vs Buy: Svix, Hookdeck, Convoy, and Custom Solutions
The build-versus-buy decision for webhook infrastructure is straightforward once you understand the true cost of building. Most teams underestimate the engineering effort by 3x to 5x because they only think about the initial delivery mechanism and forget about retries, signing, monitoring, customer UI, versioning, testing tools, and ongoing maintenance.
When to buy
If webhooks are a feature of your product (not the product itself), buy. The managed solutions have spent years solving the edge cases you have not thought of yet. Svix, Hookdeck, and Convoy each bring something different to the table.
Svix is purpose-built for sending webhooks from your application to your customers. It handles payload signing, retries, a customer-facing management portal, and delivery analytics. You integrate it with a few API calls, and your customers get a polished webhook management experience. Pricing is based on message volume, starting free for low usage. Svix is our go-to recommendation for most SaaS products because it does one thing extremely well.
Hookdeck positions itself as webhook infrastructure for both sending and receiving. If your product both sends webhooks and consumes them from third-party services (common in integration platforms), Hookdeck covers both directions. It offers request transformation, filtering, and fan-out capabilities that Svix does not focus on. The debugging tools are excellent, with full request inspection and replay.
Convoy is the open-source option. If you need to self-host (compliance requirements, data residency, or principle), Convoy gives you a full webhook gateway that you deploy on your own infrastructure. It supports both incoming and outgoing webhooks, with retry policies, rate limiting, and a management dashboard. The trade-off is that you own the operational burden of running it.
When to build
Build your own when webhooks are your core product (you are building an integration platform or event bus), when you have extreme customization requirements that the managed solutions cannot accommodate, or when you are at a scale where the per-message pricing of managed solutions exceeds the cost of dedicated engineering time. For most SaaS startups, that break-even point is somewhere above 50 million events per month.
If you do build, use the architecture patterns from the previous sections and lean on existing queue infrastructure (BullMQ, Inngest, SQS) rather than building a queue from scratch. Your custom webhook system should still follow the same conventions that Svix and Stripe established: HMAC-SHA256 signing, exponential backoff with jitter, idempotency keys, and a customer-facing dashboard. Do not invent new patterns that force your customers to learn a non-standard integration approach.
Versioning, Testing, and Debugging Tools
Webhook versioning is the thing everyone forgets until they need to change a payload structure and realize they will break every integration their customers built. Plan for it from day one.
Webhook versioning
The simplest approach is to include an api_version field in every payload and let customers pin their subscription to a specific version. When you release a new version, existing subscriptions continue receiving the old format until the customer explicitly upgrades. This is how Stripe handles it, and it works well. Maintain at least two versions simultaneously, with a deprecation timeline of 12 months minimum for older versions.
What constitutes a breaking change? Adding a new field to the payload is not breaking (consumers should ignore unknown fields). Removing a field, renaming a field, changing a field's type, or restructuring the payload is breaking. Changing event type names is breaking. When in doubt, treat it as breaking and release a new version.
Testing tools for your customers
Your customers need ways to test their webhook integrations without triggering real events. Provide these tools:
- Test event delivery: A button in the dashboard that sends a sample event to the registered endpoint. Include realistic but clearly fake data.
- Event catalog: Documentation of every event type with example payloads. Make these copy-pasteable so developers can use them in unit tests.
- Webhook CLI tool: A local development tool (similar to
stripe listen) that creates a tunnel and forwards events to localhost. This is invaluable for developers building and debugging their handlers. - Signature verification playground: An interactive tool where developers paste a payload and secret and see the computed signature. This eliminates the most common integration headache: signature verification failures caused by encoding issues.
Debugging tools for your team
On your side, you need the ability to trace an event from creation through every delivery attempt. Build (or buy) a tool that lets you search events by ID, customer, event type, or time range. For each event, show the full delivery timeline: when it was created, each attempt with the request and response, and the final outcome. This is the first thing you will reach for when a customer reports a missing webhook.
Log the full request and response for every delivery attempt, but be mindful of storage costs at scale. A reasonable retention policy is 30 days of full payloads and 90 days of metadata (timestamps, status codes, latency). Compress older payloads and archive them to cold storage if compliance requires longer retention.
If you have been building a new SaaS product and want to get webhook delivery right from the start, or you are dealing with reliability issues in your existing webhook system, book a free strategy call with our team. We have built and scaled webhook infrastructure for multiple SaaS products and can help you choose the right architecture for your specific requirements and growth trajectory.
Need help building this?
Our team has launched 50+ products for startups and ambitious brands. Let's talk about your project.