API Rate Limiting: Algorithms, Headers and Best Practices
Every production API needs a strategy for handling traffic. Without rate limiting, a single misbehaving client or a sudden traffic spike can bring down your entire service. Rate limiting controls how many requests a client can make within a given time window, protecting your infrastructure and ensuring fair usage across all consumers. This guide covers the most common algorithms, the standard HTTP headers, implementation examples and best practices for both API providers and consumers.
What Is Rate Limiting and Why Do APIs Need It?
Rate limiting is the practice of restricting the number of requests a client can send to an API within a defined time period. When a client exceeds the allowed limit, the server responds with a 429 Too Many Requests status code and typically includes headers telling the client when they can retry.
APIs implement rate limiting for several critical reasons:
- Preventing abuse. Without limits, automated scripts or malicious actors can overwhelm your servers with requests.
- Ensuring fair usage. Rate limits prevent a single heavy consumer from degrading performance for everyone else.
- Controlling costs. Each API call consumes compute, memory and bandwidth. Uncontrolled traffic directly increases infrastructure costs.
- Maintaining availability. By throttling excess traffic, you keep the service responsive for legitimate users during peak loads.
- Protecting downstream services. Your API likely depends on databases, caches and third-party services that have their own capacity limits.
Common Rate Limiting Algorithms
There are five widely used algorithms for rate limiting. Each makes different trade-offs between simplicity, memory usage and accuracy. Understanding how they work will help you choose the right one for your use case.
1. Fixed Window Counter
The simplest approach. Divide time into fixed windows (for example, one-minute intervals) and count requests per window. When the count exceeds the limit, reject further requests until the next window starts.
Window: 12:00:00 - 12:00:59 | Limit: 100 requests ───────────────────────────────────────────────────── Request #1 @ 12:00:02 → ✓ (count: 1) Request #50 @ 12:00:30 → ✓ (count: 50) Request #100 @ 12:00:45 → ✓ (count: 100) Request #101 @ 12:00:50 → ✗ 429 Too Many Requests ───────────────────────────────────────────────────── Window: 12:01:00 - 12:01:59 | Counter resets to 0 Request #1 @ 12:01:01 → ✓ (count: 1)
Pros: Very simple to implement, minimal memory (one counter per client per window). Cons: Susceptible to boundary spikes. A client can send 100 requests at 12:00:59 and another 100 at 12:01:00, effectively doubling the rate within a two-second span.
2. Sliding Window Log
Instead of fixed windows, keep a timestamped log of every request. For each new request, remove entries older than the window size and check if the remaining count exceeds the limit.
Sliding window: last 60 seconds | Limit: 100 requests Request log (sorted by time): [12:00:02, 12:00:05, 12:00:11, ... 12:00:58] New request @ 12:01:03: 1. Remove entries before 12:00:03 2. Count remaining entries 3. If count < 100 → allow and add timestamp 4. If count >= 100 → reject with 429
Pros: Very accurate, no boundary spike problem. Cons: High memory usage since you store a timestamp for every request. Not practical for high-throughput APIs.
3. Sliding Window Counter
A hybrid that combines fixed window efficiency with sliding window accuracy. Keep counters for the current and previous window, then calculate a weighted count based on how far into the current window you are.
Window size: 60 seconds | Limit: 100 requests
Previous window (12:00 - 12:01): 84 requests
Current window (12:01 - 12:02): 36 requests
Current time: 12:01:15 (25% into current window)
Weighted count = (prev * overlap%) + current
= (84 * 0.75) + 36
= 63 + 36
= 99 → ✓ allow (under 100)Pros: Low memory (two counters per client), smooths out boundary spikes. Cons: Slightly less precise than the sliding log, but the trade-off is usually worth it. This is the most popular algorithm in production systems.
4. Token Bucket
Imagine a bucket that holds tokens. Tokens are added at a steady rate (the refill rate). Each request removes one token. If the bucket is empty, the request is rejected. The bucket has a maximum capacity, which allows short bursts above the average rate.
Bucket capacity: 10 tokens | Refill rate: 1 token/sec Time Tokens Action ───── ────── ────────────────────── 0s 10 Burst: 5 requests → 5 tokens remain 1s 6 +1 refill, 1 request → 6 remain 2s 7 +1 refill, no request 3s 8 +1 refill, no request 4s 9 +1 refill, 1 request → 9 remain ... 10s 10 Bucket full (capped at capacity) Burst scenario: 0s 10 10 requests at once → 0 tokens 0.1s 0 Request → ✗ rejected (no tokens) 1s 1 +1 refill → 1 request allowed
Pros: Allows controlled bursts, very memory efficient (two numbers: current tokens and last refill time), easy to tune. Cons: Requires careful tuning of bucket size versus refill rate. Used by AWS, Stripe and many other large-scale APIs.
5. Leaky Bucket
Similar to token bucket but focused on smoothing output rather than allowing bursts. Requests enter a queue (the bucket). They are processed at a fixed rate (the leak rate). If the queue is full, new requests are dropped.
Queue capacity: 5 | Processing rate: 1 request/sec Incoming: ████████ (8 requests arrive at once) Queue: [1] [2] [3] [4] [5] ← 5 queued Dropped: [6] [7] [8] ← 3 rejected (queue full) Processing: one request leaves queue every second t=0s → process [1], queue: [2][3][4][5] t=1s → process [2], queue: [3][4][5] t=2s → process [3], queue: [4][5] ...
Pros: Produces a perfectly smooth, constant output rate. Cons: No burst tolerance. Adds latency because requests wait in the queue. Best suited for scenarios where you need a steady processing rate, like network traffic shaping.
Algorithm Comparison at a Glance
Algorithm Memory Burst Accuracy Complexity ───────────────────── ───────── ─────── ───────── ────────── Fixed Window Counter Very Low No* Low Very Low Sliding Window Log High No Very High Medium Sliding Window Counter Low No High Low Token Bucket Very Low Yes High Low Leaky Bucket Low No High Medium * Fixed window allows "accidental" bursts at window boundaries
Standard Rate Limit HTTP Headers
While there is no single universal standard, most APIs use a common set of headers to communicate rate limit status to clients. The IETF has published RFC 6585 (which defines the 429 status code) and the newer RateLimit header fields draft. Here are the headers you will encounter most often:
Header Description
───────────────────────── ──────────────────────────────────────
X-RateLimit-Limit Maximum requests allowed per window
X-RateLimit-Remaining Requests remaining in current window
X-RateLimit-Reset Unix timestamp when the window resets
Retry-After Seconds (or date) to wait before retry
(included with 429 responses)A typical successful response includes these headers alongside the normal response body:
HTTP/1.1 200 OK
Content-Type: application/json
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 847
X-RateLimit-Reset: 1711584000
{ "data": { ... } }Example 429 Rate Limit Response
When a client exceeds the rate limit, the server should return a clear, structured error response. Here is what a well-formed 429 response looks like:
HTTP/1.1 429 Too Many Requests
Content-Type: application/json
Retry-After: 30
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1711584000
{
"error": {
"type": "rate_limit_exceeded",
"message": "Rate limit exceeded. You have made too many requests.",
"retry_after": 30,
"limit": 1000,
"reset_at": "2026-03-28T12:00:00Z"
}
}Good error responses include both the Retry-After header and the retry information in the JSON body. This makes it easy for clients to handle the error programmatically regardless of how they parse the response.
Implementing Rate Limiting in Node.js / Express
Here is a simple in-memory rate limiter using the sliding window counter approach in Express. For production use, replace the in-memory store with Redis for persistence across multiple server instances.
const express = require("express");
const app = express();
// In-memory store (use Redis in production)
const clients = new Map();
function rateLimit(limit, windowMs) {
return (req, res, next) => {
const key = req.ip;
const now = Date.now();
const windowStart = now - windowMs;
if (!clients.has(key)) {
clients.set(key, []);
}
// Remove expired timestamps
const timestamps = clients
.get(key)
.filter((t) => t > windowStart);
clients.set(key, timestamps);
if (timestamps.length >= limit) {
const resetTime = timestamps[0] + windowMs;
const retryAfter = Math.ceil((resetTime - now) / 1000);
res.set({
"X-RateLimit-Limit": limit,
"X-RateLimit-Remaining": 0,
"X-RateLimit-Reset": Math.ceil(resetTime / 1000),
"Retry-After": retryAfter,
});
return res.status(429).json({
error: {
type: "rate_limit_exceeded",
message: "Too many requests. Please try again later.",
retry_after: retryAfter,
},
});
}
timestamps.push(now);
res.set({
"X-RateLimit-Limit": limit,
"X-RateLimit-Remaining": limit - timestamps.length,
"X-RateLimit-Reset": Math.ceil(
(now + windowMs) / 1000
),
});
next();
};
}
// Apply: 100 requests per 15-minute window
app.use(rateLimit(100, 15 * 60 * 1000));
app.get("/api/data", (req, res) => {
res.json({ message: "Success" });
});
app.listen(3000);For production environments, consider using the express-rate-limit package with a Redis store. It handles edge cases like proxy headers, distributed state and cleanup automatically.
Implementing Rate Limiting in Python / Flask
Here is a token bucket implementation as a Flask decorator. This approach is clean, reusable and easy to understand.
import time
from functools import wraps
from flask import Flask, request, jsonify
app = Flask(__name__)
# In-memory store (use Redis in production)
buckets = {}
class TokenBucket:
def __init__(self, capacity, refill_rate):
self.capacity = capacity
self.tokens = capacity
self.refill_rate = refill_rate # tokens per second
self.last_refill = time.time()
def consume(self):
now = time.time()
elapsed = now - self.last_refill
self.tokens = min(
self.capacity,
self.tokens + elapsed * self.refill_rate
)
self.last_refill = now
if self.tokens >= 1:
self.tokens -= 1
return True
return False
def rate_limit(capacity=100, refill_rate=1.0):
def decorator(f):
@wraps(f)
def wrapper(*args, **kwargs):
key = request.remote_addr
if key not in buckets:
buckets[key] = TokenBucket(
capacity, refill_rate
)
bucket = buckets[key]
if not bucket.consume():
retry_after = int(1 / bucket.refill_rate)
return jsonify({
"error": {
"type": "rate_limit_exceeded",
"message": "Too many requests.",
"retry_after": retry_after
}
}), 429, {
"Retry-After": str(retry_after),
"X-RateLimit-Limit": str(capacity),
"X-RateLimit-Remaining": "0"
}
response = f(*args, **kwargs)
return response
return wrapper
return decorator
@app.route("/api/data")
@rate_limit(capacity=100, refill_rate=1.67)
def get_data():
return jsonify({"message": "Success"})
if __name__ == "__main__":
app.run()Handling Rate Limits on the Client Side
When consuming an API, your code should gracefully handle 429 responses. The standard approach is exponential backoff with jitter, which progressively increases wait times while adding randomness to prevent thundering herd problems.
async function fetchWithRetry(url, options = {}) {
const maxRetries = 5;
let attempt = 0;
while (attempt < maxRetries) {
const response = await fetch(url, options);
if (response.status !== 429) {
return response;
}
attempt++;
// Use Retry-After header if available
const retryAfter = response.headers
.get("Retry-After");
let waitMs;
if (retryAfter) {
waitMs = parseInt(retryAfter, 10) * 1000;
} else {
// Exponential backoff with jitter
const base = Math.pow(2, attempt) * 1000;
const jitter = Math.random() * 1000;
waitMs = base + jitter;
}
console.log(
`Rate limited. Retrying in ${waitMs}ms ` +
`(attempt ${attempt}/${maxRetries})`
);
await new Promise((r) => setTimeout(r, waitMs));
}
throw new Error(
"Max retries exceeded. API rate limit persists."
);
}Key principles for client-side rate limit handling:
- Always respect
Retry-After. If the server tells you when to retry, use that value instead of your own backoff calculation. - Add jitter to backoff. Without jitter, all rate-limited clients will retry at the same time, causing another spike.
- Set a maximum retry count. Do not retry indefinitely. After a reasonable number of attempts, surface the error to the user or log it.
- Track remaining quota. Read
X-RateLimit-Remainingheaders proactively to slow down before you hit the limit. - Queue requests. For batch operations, use a request queue that respects the rate limit rather than firing all requests in parallel.
Best Practices for API Providers
How you implement and communicate rate limits has a major impact on developer experience. Here are the practices that distinguish well-designed APIs:
- Document your limits clearly. State the exact limits, the window size, what counts as a request and whether limits are per API key, per IP or per endpoint. Put this in a prominent location, not buried in footnotes.
- Always include rate limit headers. Return
X-RateLimit-Limit,X-RateLimit-RemainingandX-RateLimit-Reseton every response, not just 429 responses. - Include
Retry-Afteron 429 responses. This is the single most useful header for client-side retry logic. - Offer generous limits for development. Developers building integrations should not run into rate limits during testing. Consider separate, higher limits for sandbox/test environments.
- Use consistent headers across endpoints. Do not use different header names or formats on different endpoints. Consistency reduces integration friction.
- Provide a rate limit status endpoint. A dedicated endpoint (like
GET /rate-limit) that returns current usage without counting toward the limit helps developers debug issues. - Return structured error bodies. Include the error type, a human-readable message, the retry time and the limit details in the JSON response body.
- Consider tiered limits. Different plans or authentication levels can have different limits. Free tier might get 100 requests/hour while paid plans get 10,000.
Rate Limiting in Popular APIs
Looking at how major APIs handle rate limiting provides useful patterns for your own implementation.
GitHub API
GitHub uses a fixed window approach. Authenticated requests get 5,000 requests per hour. Unauthenticated requests are limited to 60 per hour, scoped by IP address. GitHub returns rate limit info on every response and provides a dedicated GET /rate_limit endpoint.
X-RateLimit-Limit: 5000 X-RateLimit-Remaining: 4987 X-RateLimit-Reset: 1711584000 X-RateLimit-Used: 13 X-RateLimit-Resource: core
Stripe API
Stripe uses a token bucket algorithm, allowing 100 requests per second in live mode and 25 per second in test mode. They differentiate between read and write operations. Stripe is notable for returning a Stripe-Should-Retry header that explicitly tells the client whether retrying will help.
HTTP/1.1 429 Too Many Requests
Retry-After: 1
Stripe-Should-Retry: true
{
"error": {
"type": "rate_limit_error",
"message": "Too many requests hitting the API ..."
}
}Twitter (X) API
Twitter uses per-endpoint rate limits with 15-minute windows. Each endpoint has its own limit (for example, 900 requests per 15 minutes for tweet lookups). Limits vary significantly across endpoints and access tiers (Free, Basic, Pro, Enterprise).
x-rate-limit-limit: 900 x-rate-limit-remaining: 823 x-rate-limit-reset: 1711584000
Key Takeaways
- Rate limiting is essential for API reliability, fairness and cost control.
- The token bucket and sliding window counter algorithms are the most popular choices for production APIs.
- Always return standard rate limit headers on every response, and include
Retry-Afteron 429 responses. - On the client side, implement exponential backoff with jitter and always respect the
Retry-Afterheader. - Document your rate limits clearly and provide generous development-tier limits to reduce integration friction.
- When debugging API issues, inspect the response headers and body carefully. A JSON formatter can help you quickly parse error responses and identify rate limit details.
Format and inspect API responses
Debugging rate limit errors? Use our JSON Formatter to pretty-print API responses, inspect error bodies and quickly find the retry information you need.
Open JSON Formatter