The shape
When to adopt this pattern
- Any per-event external call — API, email, PDF generation
- Volume above ~1k events/minute
- You need fan-out — one event dispatches to 3 systems
- Handler p95 approaching 1 second
- You want independent retry behavior from Sly’s own retries
Queue options
| Choice | Best for | Tradeoffs |
|---|---|---|
| Postgres table (pgmq, or hand-rolled) | Small-medium (up to ~5k/min) | Same DB as app; simple; scaled limit ~10k/min |
| Redis Streams / Sidekiq / BullMQ | Medium (5k-50k/min) | Extra infra; persistent; millions/min ceiling |
| SQS / Google Pub/Sub / RabbitMQ | Large | Managed; multi-region; requires operational knowledge |
| Kafka | Massive + multi-consumer | Overkill for most payment backends |
Reference: Node + BullMQ
Dedupe via jobId
Usingevent.id as the job ID gives you idempotency for free — BullMQ refuses to add a duplicate. Same for SQS (via MessageDeduplicationId), Redis Streams (XADD with ID), most other queue backends.
Retry semantics
Two levels of retry:- Sly → your handler — 5 attempts, 1m / 5m / 15m / 1h / 24h
- Your worker — configurable; typical 5 attempts with exponential backoff
attempts: 1 on the queue handler if you want Sly to stop retrying after the first ack (acks are “we got it”, not “we succeeded”). Your worker handles actual processing.
Per-event-type routing
Different event types often want different concurrency and priority:DLQ handling
Events that fail all worker retries should land in a dead-letter queue you monitor. Whatever your queue, configure a DLQ and alert on it.Back-pressure
If workers can’t keep up, the queue grows. At some point this becomes a problem:- Latency of event processing climbs
- Memory / storage on the queue fills
- Sly-side retry window exhausts before you’ve drained the backlog
Parallelism
Single-worker-per-event-type is simplest but throughput-limited. Scale:- Horizontal — more worker processes; queue library distributes
- Concurrency within worker — process N jobs in parallel per process
Shared infrastructure notes
If this worker infra is also handling non-webhook work (cron jobs, user-initiated async, etc.), consider:- Separate queues per workload — prevents starvation
- Separate worker pools — can scale independently
- Shared dedupe — if an event triggers work also doable manually, dedupe across both paths
Observability
Beyond webhook-level metrics (see event-driven):- Queue depth over time — trending up = backpressure
- Job duration by type — surfaces heavy processors
- Retry rate by type — locates flaky handlers
- DLQ arrival rate — bug / outage indicator
When this isn’t enough
If you’re at multiple million events/min or need cross-region durability, consider:- Partitioned queues (Kafka-style)
- State-machine modeled approach with durable state transitions
- Multi-region worker clusters
See also
- Event-driven — the simpler starting point
- State-machine modeled — for complex multi-step flows
- Webhook recipes — verification + dedupe patterns
