Part2: Rate Limiting for API gateways

Published in

Ambassador Labs

7 min readMay 8, 2018

In the first article of this Rate Limiting series, I introduced the motivations for rate limiting. I discussed several implementation options (depending on whether you own both sides of the communication or not) and the associated tradeoffs. This article dives a little deeper into the need for rate limiting with API gateways

Why Rate Limiting with an API Gateway?

In the first article, I discussed options for where to implement rate limiting: the source, the sink, or middleware (literally a service in the middle of the source and sink).

When exposing your application via a public API, you typically have to implement rate limiting within the sink or middleware that you own. Even if you control the source (client) application, you will typically want to guard against bugs that cause excess API requests and also against bad actors who may attempt to subvert your client applications.

The Stripe blog has an excellent article on “Scaling your API with rate limiters”, which I’ll be referencing throughout this post, and the opening section talks about how rate limiting can help make your API more reliable in the following scenarios:

One of your users is responsible for a spike in traffic that is overwhelming your application, and you need to stay up for everyone else.
One of your users has a misbehaving script that accidentally sends you many requests (trust me, this happens more often than you think — I’ve personally created load test scripts that accidentally triggered a self-inflicted denial of service!). Or, even worse, one of your users is intentionally trying to overwhelm your servers.
A user is sending you many lower-priority requests, and you want to make sure that it doesn’t affect your high-priority traffic. For example, users sending a high volume of analytics data requests could affect other users’ critical transactions.
Something in your system has gone wrong internally, and as a result, you can’t serve all of your regular traffic and need to drop low-priority requests.

At Ambassador Labs we have seen these patterns firsthand, particularly with organizations exposing “freemium” style public API, where it is a clear business requirement to prioritize traffic for paying customers and protect against bad actors (intentional or otherwise).

The Basics of Rate Limiting and Load Shedding

Fundamentally rate limiting is simple. For each request property you want to limit against, you simply count the number of times each unique instance of the property is seen and reject the associated request if this is over the specified count per time unit. So, for example, if you wanted to limit the number of requests each client made, you would use the “client identified” property (perhaps set via the request string key “clientID”, or included in the request header) and keep a count for each identifier.

You would also specify a maximum number of requests per time unit and potentially define an algorithm for how the count is decremented, rather than simply resetting the counter at the start of each unit (more on this later). When a request arrives at the API gateway, it will increment the appropriate request count and check if this increase would mean that the maximum allowable request per time unit has been exceeded. If so, this request would be rejected, most commonly returning a “Too Many Requests” HTTP 429 status code to the calling client.

Closely related to rate limiting is “load shedding.” The primary difference here is that the decision to reject traffic is not based on a property of an individual request (e.g., the clientId) but on the overall state of the application (e.g., database under heavy load). Implementing the ability to shed load at the point of ingress can save a major customer incident if the system is still partially up and running but needs time to recover (or fix).

Challenges with API Gateways

Most open source and commercial API gateways like Edge Stack offer rate limiting, but one of the challenges with many of these implementations is scalability. Running your API gateway on a single compute instance is relatively simple, and this means you can keep the rate limiting counters in memory. For example, if you were rate limiting on clientId, you would simply check and set (increment) the clientId in an in-memory map with an associated integer counter. However, this approach does not scale past the single instance to a cluster of gateway instances.

I’ve seen some developers attempt to get around this limitation by either using sticky sessions or by dividing the total maximum number of allowable requests by the number of rate limiting instances. However, the problem is that neither of these approaches works consistently when deploying and operating applications in a highly dynamic “cloud native” environment, where instances are being destroyed and recreated on-demand and scaled dynamically.

The best solution to overcome this limitation is using a high-performance centralised data store to manage the request count. For example, at Lyft, the team uses Redis (presumably run as a highly-available Redis Sentinel cluster) to track this rate limiting data via their Envoy proxy that is deployed as a sidecar to all of their services and datastores. There are some potential issues to be aware of with this approach, particularly around the atomicity of the check-and-set operations with Redis. It is recommended for performance reasons to avoid the use of locking, and both Stripe and Figma have talked about using the Lua scripting functionality (with guaranteed atomicity) within the Redis engine.

Other challenges often encountered are the ability to extract request (meta)data for use in determining the rate limit and specifying (or implementing) the associated algorithm used to determine whether a specific request should be rejected. Ideally, you would like to be able to specify rate limiting to various client properties (e.g., request HTTP method, location, device, etc.) and the decomposition of your backend (e.g., service endpoint, semantic information such as a user-initiated request vs. an app-initiated request, payload expectations).

Rate Limiting via an External Service

An interesting solution to overcome many of the challenges discussed in the previous section was presented by the Lyft Engineering team last year, when they talked about how the Envoy proxy they use as (what we are now calling) a service mesh implements rate limiting by calling out to an external RateLimit service for each request. The Ratelimit service conforms to the Ratelimit protobuf, which is a rate limit API. The Ambassador Labs team has built the open source Ambassador Edge Stack API gateway on top of the Envoy Proxy, and recently Alex Gervais has implemented the same rate limiting support for Ambassador.

As you now have access to a protobuf rate limit service API, you can implement a rate limit service in any language you like (or at least any language with protobuf support, which is most modern languages). You also have complete freedom to implement any rate limiting algorithm you like within the service and base the rate limiting decision on any metadata you want to pass to the service. The examples within the Lyft RateLimit service provide some interesting inspiration! As the Ambassador Edge Stack API gateway runs within Kubernetes, any rate limiting service you create can take advantage of Kubernetes to handle scaling and fault tolerance.

Wrapping Up with a Look to the Next Article

In this article, you have learned about the motivations for rating limiting and load shedding with an API gateway. You also explored some of the challenges with doing this. In the final section of the article, I presented some ideas around integrating rate limiting within an API gateway deployed within a modern cloud native platform (like Kubernetes, ECS, etc.), and discussed how using an external service to do this could allow a lot of flexibility with implementing your exact requirements for a rate limiting algorithm.

In the following article, we will look at implementing a Java rate limiting service for the Ambassador API gateway (here’s a sneak peak of some of the code!).

In the meantime, get in touch, or Join Ambassador Labs Community on Slack.

Learn More about Rate Limiting: