The purpose of a quota service is to prevent cascading failures in micro-service environments. The service acts as a traffic cop, slowing down traffic where necessary to prevent overloading services. Further, it can be used as a global rate limiter. For this to work, remote procedure calls (RPCs) between services consult the quota service before making a call. The service isn’t strictly for RPCs between services, and can even be used to apply quotas to database calls, for example.
The quota service is of interest to anyone building services that communicate with each other via RPC, or that communicate with shared resources such as databases over a network and are interested in limiting the impact of cascading failures due to resource starvation or poor allocation.
Whether the shared resource is a database or a service in itself, less critical systems could overwhelm a shared resource which could trigger outages in more critical functions that depend on this shared resource.
We have seen this as a practical, real-life in the past, and have attached custom rate limiter code to various parts of various services, and/or just increased capacity without visibility or measurement.
The quota service discussed here aims to ration how much a shared resource can be used by a given service, limiting the impact of such cascading failures. Increasing quotas would involve a capacity planning exercise to ensure the resource is capable of handling the allowed load.
The quota service is written as a gRPC service, implemented in Go for close-to-system runtime performance. Go's abundance of relevant open source libraries such as leaky and token buckets also helped influence this decision. Using gRPC also allows for easy client implementations, since gRPC has client libraries for a wide variety of platforms.
Naïve clients simply ask the quota service for permission to make an RPC call. Clients have sensible defaults configured, to fall back to if the quota service is unavailable or doesn't respond in a timely manner. An example of such a fallback could be a naïve rate limiter set to, say, 100 requests per second.
This design also discusses smarter clients. Smart clients maintain a client-side bucket which will be used by the application. A client-side thread will top up the bucket at the default rate. An asynchronous thread will periodically ask the quota service for more tokens to add to the bucket. This allows for greater resilience to latency spikes in the quota service, and takes the additional step of querying the quota service off the critical path. However this comes at the additional cost of a more complex client.
A gRPC client interceptor can be used to make sure quotas are checked before RPC calls are made, as demonstrated in this example. Discussions are currently underway to ensure the same level of client-side interceptor support is available across all gRPC client libraries.
The quota service and clients are all completely open source, under the Apache Software Foundation License v2.0 (ASLv2). See LICENSE for details.
In order of priority.
A general purpose rate limiter should restrict the rate at which an application makes use of a shared resource such as a database or service. Different usage patterns exist, including:
Simple rate limiting where all usage is treated equal. The calling application acquires a single token for a named service and if the token is granted, the application proceeds using the service.
Weighted rate limiting, where all usage is not considered to be equal. The calling application acquires a number of tokens for a named service, based on how expensive the call is considered to be.
LIMIT
clause restricting the number of records retrieved.Dynamically named shared resources. Same as above, except that quotas for resources are created and destroyed on the fly, based on a template.
After-the-fact quota consumption updates. Same as simple rate limiting, except that a single token is modeled around a unit of cost. For example, a millisecond of processing time. A calling application would acquire a single token to perform a task, and after performing the task, would update the quota service with the actual number of "tokens" used. After-the-fact updates could be batched to reduce network calls.
Quotas set up for Pinky
to call into TheBrain
in the Quota Service.
Pinky
asks the Quota Service for a token to call into TheBrain
.OK
.Pinky
makes the call into TheBrain
.Quotas set up for Pinky
to call into TheBrain
in the Quota service, but no tokens immediately available.
Pinky
asks the Quota Service for a token to call into TheBrain
.OK_WAIT
and the number of millis to wait, indicating that the client has to wait before proceeding.Pinky
waits, then makes the call into TheBrain
.Quotas not set up for Pinky
to call into TheBrain
in the Quota Service.
Pinky
asks the Quota Service for a token to call into TheBrain
(namespace Pinky_TheBrain
).Pinky_TheBrain
does not have a default bucket or allows for dynamic buckets.REJECT
)Pinky
uses client-side defaults to call into TheBrain
and logs/alerts accordingly.Quotas set up for Pinky
to call into TheBrain
in the Quota service, but no tokens available for a long while.
Pinky
asks the Quota Service for a token to call into TheBrain
.maxWaitTime
maxWaitTime
is configured per bucket, and can be overridden per request.REJECTED
.Pinky
does not makes the call into TheBrain
.Quotas set up for Pinky
to call into TheBrain
in the Quota service, pre-fetching tokens.
Pinky
asks the Quota Service for 10 tokens to call into TheBrain
.OK
.Pinky
uses the tokens to make calls into TheBrain
, using 1 token each time.Pinky
runs out of tokens, it asks the Quota Service for more tokens.Quotas set up for Pinky
to call into TheBrain
in the Quota service, post-accounting.
Pinky
asks the Quota Service for a token to call into TheBrain
.OK
.Pinky
uses the tokens to make calls into TheBrain
, measuring how long the call takes.Pinky
asks the Quota Service for time_taken_in_millis
tokens, as post-accounting for the call just completed.For each service-to-service call, the quota service maintains a token bucket. Tokens are added to the token bucket at a fixed rate (see detailed algorithm below). Handling an AllowRequest
involves:
AllowRequest
, however overridden maximum wait times cannot exceed a configured maximum for the bucket.Token buckets have a fixed size. If a token bucket is full, no additional tokens are added.
A token bucket has a name and a namespace to which it belongs. Namespaces have defaults that can be applied to named buckets. Namespaces can also be configured to allow dynamically created buckets from a template. Names and namespaces are case-sensitive. Valid characters for names and namespaces are those that match this regexp: [a-zA-Z0-9_]+
.
A token bucket namespace for requests from Pinky
to TheBrain
, for all services:
Pinky_TheBrain
A token bucket for requests from Pinky
to TheBrain
, for all endpoints on the UserService
Pinky_TheBrain:UserService
A token bucket for requests from Pinky
to TheBrain
, for UserService.getUser()
Pinky_TheBrain:UserService_getUser
A token bucket namespace for Pinky
making calls to its MySQL database:
Pinky_PinkyMySQL
A token bucket for Pinky
making calls to the users
table in its MySQL database:
Pinky_PinkyMySQL:users
A token bucket for Pinky
making calls to the users
table in its MySQL database, for inserts:
Pinky_PinkyMySQL:users_insert
A token bucket namespace for all user login requests in TheBrain
:
TheBrain_userLogins
Dynamic bucket configured for this namespace, restricting user logins to at most 1/sec. Requests would look like:
TheBrain_userLogins:${userId}
Token buckets are stored in a map, allowing for constant time lookups. This map is is keyed on bucket name (as described above), pointing to an instance of a token bucket. Token buckets are created and added to the map lazily.
The preference is to use a named bucket if possible. For example, given an AllowRequest
for Pinky_TheBrain:UserService_getUser
, the quota service would look for token buckets in the following order:
Pinky_TheBrain:UserService_getUser
Create a dynamic bucket in the Pinky_TheBrain
namespace, if allowed.
Use a default bucket in the Pinky_TheBrain
namespace, if allowed.
Use a global default bucket, if allowed.
If a bucket doesn’t exist but the namespace is configured to allow dynamic buckets, a named bucket is created using defaults from a template as defined on the namespace. If configured to allow dynamic buckets, a namespace will also be configured with a limit of dynamic buckets it may create.
Buckets may be deleted to reclaim memory. A bucket can have a maximum idle time defined, after which it is removed. Accesses to buckets are recorded. If a bucket is removed and subsequently accessed, it is created anew.
If a bucket isn't found and dynamic buckets are not enabled for a namespace, behavior depends on whether a default bucket is configured on the namespace. If one is configured, it is used. If not, a global default bucket is attempted. If a global default bucket doesn’t exist, the call fails.
Buckets are maintained solely in-memory, and are not persisted. If a server fails and is restarted, buckets are recreated as per configuration and will start empty. The replenishing thread also starts immediately, providing each bucket with tokens.
In future persisting buckets to disk may be considered but for now is considered out of scope.
Configurations for each bucket are stored in memory, alongside each bucket, after reading them from a configuration YAML file. Once YAML file support for configurations is removed, configurations will be managed via a web based admin console and persisted to a durable back-end, with adapters for storing on disk as well as other destinations such as MySQL, Zookeeper or etcd as examples, for greater durability.
Tokens are added to a bucket lazily when tokens are requested and sufficient time has passed to allow additional permits to be added, taking inspiration from Guava’s RateLimiter library.
The algorithm below lazily accumulates tokens, up to the size of the bucket, based on the time difference between when tokens will next become available. If a caller requests for more tokens than is available, the tokens are granted, but the next caller will have to wait for more tokens to become available. This is achieved by incrementing tokensNextAvailable
for a specified period of time to pay back the "token debt" of the previous caller.
Note that a maxDebtNanos
configuration parameter is maintained per bucket, to ensure requests don’t attempt to acquire a large number of tokens, thereby locking up the quota service for a long period of time.
var tokensNextAvailableNanos // Stored per bucket
var accumulatedTokens // Stored per bucket
var bucketSize // from config
var fillRate // from config, as tokens/sec
var nanosBetweenTokens = 1E+9 / fillRate
var freshTokens = 0
if currentTimeNanos > tokensNextAvailableNanos {
freshTokens = (currentTimeNanos() - tokensNextAvailableNanos) / nanosBetweenTokens
accumulatedTokens = min(bucketSize, accumulatedTokens + freshTokens)
tokensNextAvailableNanos = currentTimeNanos
}
waitTime = tokensNextAvailableNanos - currentTimeNanos
accumulatedTokensUsed = min(accumulatedTokens, requested)
tokensToWaitFor = requested - accumulatedTokensUsed
futureWaitNanos = tokensToWaitFor * nanosBetweenTokens
if tokensNextAvailableNanos + futureWaitNanos - currentTimeNanos > maxDebtNanos {
// Return rejection response
} else {
tokensNextAvailableNanos = tokensNextAvailableNanos + futureWaitNanos
accumulatedTokens = accumulatedTokens - accumulatedTokensUsed
// waitTime contains how long the caller has to wait for, before using token requested.
// Return success response with waitTime
}
A protobuf service endpoint will be exposed by the quota service, as defined here.
While we’re designing for a gRPC-based API, it is conceivable that other RPC mechanisms may also be desired, such as Thrift or even simple JSON-over-HTTP. To this end, the quota service is designed to plug into any request/response style RPC mechanism, by providing an interface as an extension point, that would have to be implemented to support more RPC mechanisms.
type RpcEndpoint interface {
Init(cfgs *configs.Configs, qs service.QuotaService)
Start()
Stop()
}
The QuotaService
interface passed in encapsulates the token buckets and provides the basic functionality of acquiring quotas.
type QuotaService interface {
Allow(bucketName string, tokensRequested int, emptyBucketPolicyOverride EmptyBucketPolicyOverride) (int, error)
}
The built-in gRPC implementation of the RpcEndpoint interface, for example, simply adapts the protobuf service implementation to call in to QuotaService.Allow, transforming parameters accordingly.
The quota service can be run as a single node, however it will have limited scalability and availability characteristics when run in this manner. As such, it is also designed to run in a cluster, backed by a shared data structure that holds the token buckets. Any node may update the data structure so requests can be load balanced to all quota service nodes.
The shared data structure is treated as ephemeral, and as such, complexities of persistence, replication, disaster recovery are all averted. If the shared data structure’s contents are lost, buckets are lazily rebuilt as needed, as per best-effort guarantees.
The only available shared data structure at the moment is backed by Redis. Redis performs to within expectations (see below on SLOs). Redis is treated as ephemeral, so persisting or adding durability to Redis' state is unnecessary.
Other implementations - including ones based on distributed consensus algorithms - can easily be plugged in.
QuotaService supports sharding using Envoy. It partitions based on the namespace and the bucket name (namespace:bucket
).
The quota service makes use of standard Go logging. However this can be overridden to allow for different logging back-ends by passing in a logger implementing Logger:
// Logger mimics golang's standard Logger as an interface.
type Logger interface {
Fatal(args ...interface{})
Fatalf(format string, args ...interface{})
Fatalln(args ...interface{})
Print(args ...interface{})
Printf(format string, args ...interface{})
Println(args ...interface{})
}
Listeners can be attached, to be notified of events that take place, such as:
Each event callback passes the caller the following details:
type Event interface {
EventType() EventType
Namespace() string
BucketName() string
Dynamic() bool
NumTokens() int64
WaitTime() time.Duration
}
where EventType
is defined as:
type EventType int
const (
EVENT_TOKENS_SERVED EventType = iota
EVENT_TIMEOUT_SERVING_TOKENS
EVENT_TOO_MANY_TOKENS_REQUESTED
EVENT_BUCKET_MISS
EVENT_BUCKET_CREATED
EVENT_BUCKET_REMOVED
)
Metrics can be implemented by attaching an event listener and collecting data from the event.
The following configuration elements need to be provided to the quota service:
Global:
For each namespace:
0
i.e., unlimited)For each bucket:
100
)50
)1000
)-1
)10000
)fill_rate
)See the GoDocs on configs.ServiceConfig
for more details.
A prototype of the quota service, currently deployed in staging, backed by a single Redis node, exhibits the following performance characteristics when served by 2 QuotaService nodes. Each node restricted to 4 CPU cores.
Completed 1,519,404 iterations, mean 947.051µs each, at 135,040 requests/sec.
Latency at percentiles:
50 th percentile: 884.735µs
75 th percentile: 983.039µs
90 th percentile: 1.114111ms
95 th percentile: 1.179647ms
99 th percentile: 2.031615ms
99.9 th percentile: 6.553599ms
99.99 th percentile: 209.715199ms
Test time 10m0.114231921s
While latencies 99th percentile and above aren’t ideal, this is just a prototype and will go through profiling and performance tuning before being deployed to production. During this test, host resources both on the quota servers and the shared Redis back-end were under-utilized.
The same prototype was pushed harder, with greater load, yielding the following results:
Completed 5,895,093 iterations, mean 2.235415ms each, at 228,864 requests/sec.
Latency at percentiles:
50 th percentile: 1.114111ms
75 th percentile: 1.310719ms
90 th percentile: 1.966079ms
95 th percentile: 3.801087ms
99 th percentile: 25.165823ms
99.9 th percentile: 142.606335ms
99.99 th percentile: 318.767103ms
Test time 10m1.060592248s
While capable of dealing with far higher throughput, we do see latency spike after the 95th percentile. More investigation will be needed, but this level of throughput is well above our targets in any case.
Since the quota service is a best-effort system, failure scenarios are easy to deal with.
When faced with latency spikes or communication failures with the service, clients will fall back to client-side rate-limiters preconfigured with sensible/conservative defaults.
When faced with a loss of token buckets (i.e., loss of Redis state), a service simply initiates a new token bucket, losing no more than a few seconds of accurate traffic shaping. Based on how buckets are configured (size, fill rates, fill frequency) impact could be minimal.
Since clients are generated gRPC service stubs, there is no design beyond the protobuf service definition for this phase. Services actively call into the quota service to obtain permissions before making RPC calls.
These clients are implemented as gRPC client-side interceptors, and are designed to rate-limit gRPC-based service-to-service calls. The calling service would just need to make an RPC call to a target service, and the interceptor will, transparent to the caller, first call into the quota service, blocking if necessary. Such clients are configured with fallback policies, so in the event of being unable to contact the quota service, will fall back to a simple rate limiter with configured limits.
Naive clients are configured with:
The quota service clients will expose the following metrics:
Smart clients allow for periodic, asynchronous communication with the quota service to request for tokens in batches, and always make use of a client-side token bucket when rate-limiting.
// TODO(manik) Complete this section