Ask HN: How do you monitor and retry failed webhooks in production?

4 pointsposted 10 hours ago
by GoatPerfect

Item id: 47105306

6 Comments

blundergoat

10 hours ago

We treat webhooks as at-least-once delivery over an unreliable transport and design for duplicates and out-of-order events.

A few rules that have saved us:

- Persist before responding. Never process inline. Write payload to DB, return 200 fast.

- Idempotency key required. Either provider event ID or hash the payload.

- Async worker processes from queue. Exponential backoff + max attempts.

- Dead letter queue + dashboard. Humans need visibility.

- Alert on backlog growth, not single failures. One failure is noise. A growing retry queue is signal.

- Relying on provider retries alone has bitten us more than once.

GoatPerfect

9 hours ago

Thank you so much for tips! I was feeling nervous about relying on provider retires as well. I especially like the idea of alerting on backlog growth. There's nothing I hate more than a bunch of emails and notifications!

JacobArthurs

9 hours ago

We receive the webhook, return 200 immediately, and push the payload to a message queue for processing. That way you own the retry logic, can inspect stuck messages, and DLQ alerts handle repeated failures automatically.

Idempotency becomes your responsibility, though, since messages can be delivered more than once.