Design distributed rate limiters

- Recent engineering writeups detailed how to build distributed rate limiters with Redis token buckets, atomic updates, sharding, and hot-key controls for APIs spread across many servers. - A companion system-design post applied the same traffic-control ideas to notifications, sketching a platform that accepts and dispatches 100 million messages per second. - Together they frame rate limiting as core infrastructure for queues, retries, and provider isolation in large distributed systems. (martinuke0.github.io) (crackingwalnuts.com)

A distributed rate limiter is a shared speed governor for an API: every server checks the same budget before it lets a request through. (martinuke0.github.io) (redis.io) The common design is a token bucket. Tokens refill over time, requests spend them, and short bursts are allowed until the bucket runs dry. (martinuke0.github.io) (redis.io) That model works on one machine, but breaks once traffic is spread across many application instances and regions. A per-server counter lets each node overshoot the real global limit. (martinuke0.github.io) Redis shows up here because it is fast, shared, and supports atomic updates through Lua scripts. That lets one operation refill tokens, subtract cost, and return allow-or-deny without race conditions. (martinuke0.github.io) (redis.io) At higher scale, the hard part is not the algorithm but the distribution of keys. If one customer, IP block, or API token becomes a hot key, a single Redis shard can become the bottleneck. (martinuke0.github.io) The March 9 writeup walks through the production answers: shard the buckets across a Redis cluster, separate global and per-user limits, and plan for clock skew and multi-region replication. (martinuke0.github.io) The same control problem shows up in notification systems. A platform that accepts push traffic for web, Android, and iOS still needs shared budgets, because one slow provider can clog the whole pipeline if sends happen inline. (crackingwalnuts.com) The March 6 notification design argues for asynchronous queues, horizontal partitioning, and at-least-once delivery instead of exactly-once promises. It says deduplication and idempotency are the practical way to prevent duplicate sends after retries. (crackingwalnuts.com) (littlehorse.io) Retries are useful only if they are measured. Queue systems need backoff for transient failures and dead-letter queues for poison messages that will never succeed. (littlehorse.io) That is why rate limiting is usually enforced near the gateway and then repeated deeper in the stack with separate budgets for tenants, users, and downstream providers. One limiter protects the front door; another keeps internal queues and vendors from being flooded. (martinuke0.github.io) (crackingwalnuts.com) The through line in both pieces is simple: once traffic reaches millions of requests or messages per second, fairness, retries, and observability become the product. The systems that stay debuggable are the ones that treat limits as shared infrastructure, not a helper function in one service. (martinuke0.github.io) (crackingwalnuts.com)

Design distributed rate limiters

Get your own daily briefing