Table of contents:
The most common mechanism available in both open source and commercial API gateways is rate limiting: making sure a particular client sends no more than a certain number of requests per unit time. As it turns out, this is exactly the wrong abstraction for multiple reasons.
This small project aims to reproduce results from Jon Moore's talk
Stop Rate Limiting! Capacity Management Done Right.
He illustrated where rate limiting can break down using
Little's law N = X * R:
- N is capacity (number of workers)
- X is throughput (requests arrival rate)
- R is service time (how long it takes a worker to process a request)
In the examples client sends 5 requests per second using 10 workers which wait for a response no longer than 2.5 seconds.
$ ./client -worker=10 -rps=5 -origin=http://origin:8000Origin server has an SLO to serve 99% of requests within 1 second. It has fixed number of workers, each takes 1 second on average to process a request. Requests are enqueued when workers are busy and discarded when the queue is full.
$ ./origin -worker=7 -worktime=1s -queue=100According to Little's law, origin should be able to handle 7 requests per second.
N = X * R
7 workers = X rps * 1s
X = 7/1 = 7 rps
Run the programs in Docker Compose and check out Grafana dashboard:
- client sends 5 requests per second and receives HTTP 200 OK responses
- origin processes 5 requests per second
- origin has served requests with average latency 1 second
- origin has served 50% of requests (50th percentile) within 1 second
- origin has served 99% of requests (99th percentile) within 1 second
Note, quantiles are estimated.
Almost all observations fall into the bucket {le="1.05"}, i.e. the bucket from 1s to 1.05s.
The histogram implementation guarantees that the true 99th percentile is somewhere between 1s and 1.05s.
The calculated quantile might give an impression that API is close to breaching the SLO
if bucket boundaries were not chosen appropriately (sharp spikes).
Clone the repository.
$ git clone https://github.com/marselester/capacity.git
$ cd ./capacity/dockerRun origin server, client (load generator), Grafana, and Prometheus using Docker Compose.
$ docker-machine start
$ eval "$(docker-machine env)"
$ docker-compose up
$ docker-machine ip
192.168.99.100Open Grafana http://192.168.99.100:3000 with default credentials admin/admin. Prometheus dashboard is available at http://192.168.99.100:9090.
Clean up once you've done experimenting.
$ docker-compose down
$ docker rmi marselester/capacity
$ docker image prune --filter label=stage=intermediate
$ docker-machine stopNew release of origin server has a bug that made workers process a request within 2 seconds.
$ ORIGIN_WORKTIME=2s docker-compose upAccording to Little's law, origin should be able to handle 3.5 requests per second.
N = X * R
7 workers = X rps * 2s
X = 7/2 = 3.5 rps
Since worker pool is able to process 3.5 requests per second, it can drain the queue at the same rate. Requests arrive at 5 rps, which means queue will be growing infinitely. Therefore processing time will also be growing infinitely.
N = X * R
∞ = 3.5 rps * R
R = ∞ / 3.5 = ∞ seconds
Observations:
- client sends 5 requests per second and they all time out
- origin processes 3.5 requests per second
- origin's average latency grows with queue
- origin's 50th and 99th percentiles show maximum configured latency of 4 seconds (the biggest bucket)
Developers increased number of origin workers to 20 while they investigate why a worker takes 2 seconds to process a request instead of 1 second.
$ ORIGIN_WORKER=20 ORIGIN_WORKTIME=2s docker-compose upAccording to Little's law, origin should be able to handle 10 requests per second.
N = X * R
20 workers = X rps * 2s
X = 20/2 = 10 rps
Observations:
- client sends 5 requests per second and receives HTTP 200 OK responses
- origin processes 5 requests per second
- origin has served requests with average latency 2 seconds
- origin has served 50% of requests (50th percentile) within 2 seconds
- origin has served 99% of requests (99th percentile) within 2 seconds
Client should be able to send X=5 requests per second with average response time R=1 second. This means a client should be limited to N=5 concurrent requests on a proxy.
N = X * R
N = 5 rps * 1s = 5 requests in flight
When processing time increases to R=2 seconds, a client is still limited to 5 requests in flight by proxy. Therefore a client will end up limited to X=2.5 rps.
N = X * R
5 requests in flight = X rps * 2s
X = 5/2 = 2.5 rps
In order to allow a service to recover, a client is forced to back-off: send 2.5 rps instead of 5 rps. Proxy limits concurrency (how many requests are in flight), not request rate (rps).
$ ORIGIN_WORKTIME=2s CLIENT_ORIGIN=http://proxy:7000 docker-compose upObservations:
- client sends 5 requests per second and receives 2.3 rps (HTTP 200)
- proxy oscillates between 4 and 5 in-flight requests
- origin processes 2.3 requests per second
- origin has served requests with average latency 2 seconds
- origin has served 50% of requests (50th percentile) within 2 seconds
- origin has served 99% of requests (99th percentile) within 2 seconds
If origin's capacity is unknown, it's possible to estimate capacity using AIMD algorithm (additive-increase/multiplicative-decrease):
- increase target concurrency by a constant cper unit time, e.g., allow 1 more rps every second
- set target concurrency to a fraction pof its current size (0 <= p <= 1), e.g., back-off to 75% when a service is overloaded (429 or 50x status codes, connection timeout)
Client waits for 2.5 seconds before timing out (cancels request). In order to receive HTTP 200 OK responses a request should be processed in less than 2.5 seconds.
Request is processed within 2 seconds by origin's workers (N=7).
N = X * R
7 workers = X rps * 2s
X = 7/2 = 3.5 rps
Request has only 2.5s - 2s = 0.5 second to stay in a queue, otherwise a client won't receive it due to timeout. Expected origin's queue length is 1.75 requests.
N = X * R
N = 3.5 rps * 0.5s = 1.75 requests in queue
Expected number of concurrent requests to origin is 8.75.
N = X * R
N = 3.5 rps * 2.5s = 8.75 requests in flight
Performance of my naive proxy is inferior and results don't correspond to Jon Moore's chart (Nginx/Lua).
$ ORIGIN_WORKTIME=2s CLIENT_ORIGIN=http://proxy:7000 PROXY_ADAPTIVE=true docker-compose upObservations:
- client sends 5 requests per second and receives between 2.4 rps and 2.7 rps (HTTP 200)
- proxy oscillates between 4 and 6 in-flight requests
- origin processes between 2.7 and 3 requests per second
- origin has served requests with average latency 2.2 seconds
- origin has served 50% of requests (50th percentile) within 2 seconds
- origin has served 99% of requests (99th percentile) within 3 seconds with periodic spikes





