2026-03-28

API Rate Limiting: Algorithms, Headers and Best Practices

Every production API needs a strategy for handling traffic. Without rate limiting, a single misbehaving client or a sudden traffic spike can bring down your entire service. Rate limiting controls how many requests a client can make within a given time window, protecting your infrastructure and ensuring fair usage across all consumers. This guide covers the most common algorithms, the standard HTTP headers, implementation examples and best practices for both API providers and consumers.

What Is Rate Limiting and Why Do APIs Need It?

Rate limiting is the practice of restricting the number of requests a client can send to an API within a defined time period. When a client exceeds the allowed limit, the server responds with a 429 Too Many Requests status code and typically includes headers telling the client when they can retry.

APIs implement rate limiting for several critical reasons:

Preventing abuse. Without limits, automated scripts or malicious actors can overwhelm your servers with requests.
Ensuring fair usage. Rate limits prevent a single heavy consumer from degrading performance for everyone else.
Controlling costs. Each API call consumes compute, memory and bandwidth. Uncontrolled traffic directly increases infrastructure costs.
Maintaining availability. By throttling excess traffic, you keep the service responsive for legitimate users during peak loads.
Protecting downstream services. Your API likely depends on databases, caches and third-party services that have their own capacity limits.

Common Rate Limiting Algorithms

There are five widely used algorithms for rate limiting. Each makes different trade-offs between simplicity, memory usage and accuracy. Understanding how they work will help you choose the right one for your use case.

1. Fixed Window Counter

The simplest approach. Divide time into fixed windows (for example, one-minute intervals) and count requests per window. When the count exceeds the limit, reject further requests until the next window starts.

Window: 12:00:00 - 12:00:59  |  Limit: 100 requests
─────────────────────────────────────────────────────
  Request #1   @ 12:00:02  →  ✓  (count: 1)
  Request #50  @ 12:00:30  →  ✓  (count: 50)
  Request #100 @ 12:00:45  →  ✓  (count: 100)
  Request #101 @ 12:00:50  →  ✗  429 Too Many Requests
─────────────────────────────────────────────────────
Window: 12:01:00 - 12:01:59  |  Counter resets to 0
  Request #1   @ 12:01:01  →  ✓  (count: 1)

Pros: Very simple to implement, minimal memory (one counter per client per window). Cons: Susceptible to boundary spikes. A client can send 100 requests at 12:00:59 and another 100 at 12:01:00, effectively doubling the rate within a two-second span.

2. Sliding Window Log

Instead of fixed windows, keep a timestamped log of every request. For each new request, remove entries older than the window size and check if the remaining count exceeds the limit.

Sliding window: last 60 seconds  |  Limit: 100 requests

Request log (sorted by time):
  [12:00:02, 12:00:05, 12:00:11, ... 12:00:58]

New request @ 12:01:03:
  1. Remove entries before 12:00:03
  2. Count remaining entries
  3. If count < 100 → allow and add timestamp
  4. If count >= 100 → reject with 429

Pros: Very accurate, no boundary spike problem. Cons: High memory usage since you store a timestamp for every request. Not practical for high-throughput APIs.

3. Sliding Window Counter

A hybrid that combines fixed window efficiency with sliding window accuracy. Keep counters for the current and previous window, then calculate a weighted count based on how far into the current window you are.

Window size: 60 seconds  |  Limit: 100 requests

Previous window (12:00 - 12:01): 84 requests
Current window  (12:01 - 12:02): 36 requests

Current time: 12:01:15 (25% into current window)

Weighted count = (prev * overlap%) + current
               = (84 * 0.75) + 36
               = 63 + 36
               = 99  →  ✓ allow (under 100)

Pros: Low memory (two counters per client), smooths out boundary spikes. Cons: Slightly less precise than the sliding log, but the trade-off is usually worth it. This is the most popular algorithm in production systems.

4. Token Bucket

Imagine a bucket that holds tokens. Tokens are added at a steady rate (the refill rate). Each request removes one token. If the bucket is empty, the request is rejected. The bucket has a maximum capacity, which allows short bursts above the average rate.

Bucket capacity: 10 tokens  |  Refill rate: 1 token/sec

Time   Tokens   Action
─────  ──────   ──────────────────────
0s     10       Burst: 5 requests → 5 tokens remain
1s     6        +1 refill, 1 request → 6 remain
2s     7        +1 refill, no request
3s     8        +1 refill, no request
4s     9        +1 refill, 1 request → 9 remain
...
10s    10       Bucket full (capped at capacity)

Burst scenario:
0s     10       10 requests at once → 0 tokens
0.1s   0        Request → ✗ rejected (no tokens)
1s     1        +1 refill → 1 request allowed

Pros: Allows controlled bursts, very memory efficient (two numbers: current tokens and last refill time), easy to tune. Cons: Requires careful tuning of bucket size versus refill rate. Used by AWS, Stripe and many other large-scale APIs.

5. Leaky Bucket

Similar to token bucket but focused on smoothing output rather than allowing bursts. Requests enter a queue (the bucket). They are processed at a fixed rate (the leak rate). If the queue is full, new requests are dropped.

Queue capacity: 5  |  Processing rate: 1 request/sec

Incoming:  ████████  (8 requests arrive at once)

Queue:     [1] [2] [3] [4] [5]   ← 5 queued
Dropped:   [6] [7] [8]           ← 3 rejected (queue full)

Processing: one request leaves queue every second
  t=0s  → process [1], queue: [2][3][4][5]
  t=1s  → process [2], queue: [3][4][5]
  t=2s  → process [3], queue: [4][5]
  ...

Pros: Produces a perfectly smooth, constant output rate. Cons: No burst tolerance. Adds latency because requests wait in the queue. Best suited for scenarios where you need a steady processing rate, like network traffic shaping.

Algorithm Comparison at a Glance

Algorithm              Memory     Burst    Accuracy   Complexity
─────────────────────  ─────────  ───────  ─────────  ──────────
Fixed Window Counter   Very Low   No*      Low        Very Low
Sliding Window Log     High       No       Very High  Medium
Sliding Window Counter Low        No       High       Low
Token Bucket           Very Low   Yes      High       Low
Leaky Bucket           Low        No       High       Medium

* Fixed window allows "accidental" bursts at window boundaries

Standard Rate Limit HTTP Headers

While there is no single universal standard, most APIs use a common set of headers to communicate rate limit status to clients. The IETF has published RFC 6585 (which defines the 429 status code) and the newer RateLimit header fields draft. Here are the headers you will encounter most often:

Header                     Description
─────────────────────────  ──────────────────────────────────────
X-RateLimit-Limit          Maximum requests allowed per window
X-RateLimit-Remaining      Requests remaining in current window
X-RateLimit-Reset          Unix timestamp when the window resets
Retry-After                Seconds (or date) to wait before retry
                           (included with 429 responses)

A typical successful response includes these headers alongside the normal response body:

HTTP/1.1 200 OK
Content-Type: application/json
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 847
X-RateLimit-Reset: 1711584000

{ "data": { ... } }

Example 429 Rate Limit Response

When a client exceeds the rate limit, the server should return a clear, structured error response. Here is what a well-formed 429 response looks like:

HTTP/1.1 429 Too Many Requests
Content-Type: application/json
Retry-After: 30
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1711584000

{
  "error": {
    "type": "rate_limit_exceeded",
    "message": "Rate limit exceeded. You have made too many requests.",
    "retry_after": 30,
    "limit": 1000,
    "reset_at": "2026-03-28T12:00:00Z"
  }
}

Good error responses include both the Retry-After header and the retry information in the JSON body. This makes it easy for clients to handle the error programmatically regardless of how they parse the response.

Implementing Rate Limiting in Node.js / Express

Here is a simple in-memory rate limiter using the sliding window counter approach in Express. For production use, replace the in-memory store with Redis for persistence across multiple server instances.

const express = require("express");
const app = express();

// In-memory store (use Redis in production)
const clients = new Map();

function rateLimit(limit, windowMs) {
  return (req, res, next) => {
    const key = req.ip;
    const now = Date.now();
    const windowStart = now - windowMs;

    if (!clients.has(key)) {
      clients.set(key, []);
    }

    // Remove expired timestamps
    const timestamps = clients
      .get(key)
      .filter((t) => t > windowStart);
    clients.set(key, timestamps);

    if (timestamps.length >= limit) {
      const resetTime = timestamps[0] + windowMs;
      const retryAfter = Math.ceil((resetTime - now) / 1000);

      res.set({
        "X-RateLimit-Limit": limit,
        "X-RateLimit-Remaining": 0,
        "X-RateLimit-Reset": Math.ceil(resetTime / 1000),
        "Retry-After": retryAfter,
      });

      return res.status(429).json({
        error: {
          type: "rate_limit_exceeded",
          message: "Too many requests. Please try again later.",
          retry_after: retryAfter,
        },
      });
    }

    timestamps.push(now);

    res.set({
      "X-RateLimit-Limit": limit,
      "X-RateLimit-Remaining": limit - timestamps.length,
      "X-RateLimit-Reset": Math.ceil(
        (now + windowMs) / 1000
      ),
    });

    next();
  };
}

// Apply: 100 requests per 15-minute window
app.use(rateLimit(100, 15 * 60 * 1000));

app.get("/api/data", (req, res) => {
  res.json({ message: "Success" });
});

app.listen(3000);

For production environments, consider using the express-rate-limit package with a Redis store. It handles edge cases like proxy headers, distributed state and cleanup automatically.

Implementing Rate Limiting in Python / Flask

Here is a token bucket implementation as a Flask decorator. This approach is clean, reusable and easy to understand.

import time
from functools import wraps
from flask import Flask, request, jsonify

app = Flask(__name__)

# In-memory store (use Redis in production)
buckets = {}

class TokenBucket:
    def __init__(self, capacity, refill_rate):
        self.capacity = capacity
        self.tokens = capacity
        self.refill_rate = refill_rate  # tokens per second
        self.last_refill = time.time()

    def consume(self):
        now = time.time()
        elapsed = now - self.last_refill
        self.tokens = min(
            self.capacity,
            self.tokens + elapsed * self.refill_rate
        )
        self.last_refill = now

        if self.tokens >= 1:
            self.tokens -= 1
            return True
        return False

def rate_limit(capacity=100, refill_rate=1.0):
    def decorator(f):
        @wraps(f)
        def wrapper(*args, **kwargs):
            key = request.remote_addr

            if key not in buckets:
                buckets[key] = TokenBucket(
                    capacity, refill_rate
                )

            bucket = buckets[key]

            if not bucket.consume():
                retry_after = int(1 / bucket.refill_rate)
                return jsonify({
                    "error": {
                        "type": "rate_limit_exceeded",
                        "message": "Too many requests.",
                        "retry_after": retry_after
                    }
                }), 429, {
                    "Retry-After": str(retry_after),
                    "X-RateLimit-Limit": str(capacity),
                    "X-RateLimit-Remaining": "0"
                }

            response = f(*args, **kwargs)
            return response
        return wrapper
    return decorator

@app.route("/api/data")
@rate_limit(capacity=100, refill_rate=1.67)
def get_data():
    return jsonify({"message": "Success"})

if __name__ == "__main__":
    app.run()

Handling Rate Limits on the Client Side

When consuming an API, your code should gracefully handle 429 responses. The standard approach is exponential backoff with jitter, which progressively increases wait times while adding randomness to prevent thundering herd problems.

async function fetchWithRetry(url, options = {}) {
  const maxRetries = 5;
  let attempt = 0;

  while (attempt < maxRetries) {
    const response = await fetch(url, options);

    if (response.status !== 429) {
      return response;
    }

    attempt++;

    // Use Retry-After header if available
    const retryAfter = response.headers
      .get("Retry-After");

    let waitMs;
    if (retryAfter) {
      waitMs = parseInt(retryAfter, 10) * 1000;
    } else {
      // Exponential backoff with jitter
      const base = Math.pow(2, attempt) * 1000;
      const jitter = Math.random() * 1000;
      waitMs = base + jitter;
    }

    console.log(
      `Rate limited. Retrying in ${waitMs}ms ` +
      `(attempt ${attempt}/${maxRetries})`
    );

    await new Promise((r) => setTimeout(r, waitMs));
  }

  throw new Error(
    "Max retries exceeded. API rate limit persists."
  );
}

Key principles for client-side rate limit handling:

Always respect Retry-After. If the server tells you when to retry, use that value instead of your own backoff calculation.
Add jitter to backoff. Without jitter, all rate-limited clients will retry at the same time, causing another spike.
Set a maximum retry count. Do not retry indefinitely. After a reasonable number of attempts, surface the error to the user or log it.
Track remaining quota. Read X-RateLimit-Remaining headers proactively to slow down before you hit the limit.
Queue requests. For batch operations, use a request queue that respects the rate limit rather than firing all requests in parallel.

Best Practices for API Providers

How you implement and communicate rate limits has a major impact on developer experience. Here are the practices that distinguish well-designed APIs:

Document your limits clearly. State the exact limits, the window size, what counts as a request and whether limits are per API key, per IP or per endpoint. Put this in a prominent location, not buried in footnotes.
Always include rate limit headers. Return X-RateLimit-Limit, X-RateLimit-Remaining and X-RateLimit-Reset on every response, not just 429 responses.
Include Retry-After on 429 responses. This is the single most useful header for client-side retry logic.
Offer generous limits for development. Developers building integrations should not run into rate limits during testing. Consider separate, higher limits for sandbox/test environments.
Use consistent headers across endpoints. Do not use different header names or formats on different endpoints. Consistency reduces integration friction.
Provide a rate limit status endpoint. A dedicated endpoint (like GET /rate-limit) that returns current usage without counting toward the limit helps developers debug issues.
Return structured error bodies. Include the error type, a human-readable message, the retry time and the limit details in the JSON response body.
Consider tiered limits. Different plans or authentication levels can have different limits. Free tier might get 100 requests/hour while paid plans get 10,000.

Rate Limiting in Popular APIs

Looking at how major APIs handle rate limiting provides useful patterns for your own implementation.

GitHub API

GitHub uses a fixed window approach. Authenticated requests get 5,000 requests per hour. Unauthenticated requests are limited to 60 per hour, scoped by IP address. GitHub returns rate limit info on every response and provides a dedicated GET /rate_limit endpoint.

X-RateLimit-Limit: 5000
X-RateLimit-Remaining: 4987
X-RateLimit-Reset: 1711584000
X-RateLimit-Used: 13
X-RateLimit-Resource: core

Stripe API

Stripe uses a token bucket algorithm, allowing 100 requests per second in live mode and 25 per second in test mode. They differentiate between read and write operations. Stripe is notable for returning a Stripe-Should-Retry header that explicitly tells the client whether retrying will help.

HTTP/1.1 429 Too Many Requests
Retry-After: 1
Stripe-Should-Retry: true

{
  "error": {
    "type": "rate_limit_error",
    "message": "Too many requests hitting the API ..."
  }
}

Twitter (X) API

Twitter uses per-endpoint rate limits with 15-minute windows. Each endpoint has its own limit (for example, 900 requests per 15 minutes for tweet lookups). Limits vary significantly across endpoints and access tiers (Free, Basic, Pro, Enterprise).

x-rate-limit-limit: 900
x-rate-limit-remaining: 823
x-rate-limit-reset: 1711584000

Key Takeaways

Rate limiting is essential for API reliability, fairness and cost control.
The token bucket and sliding window counter algorithms are the most popular choices for production APIs.
Always return standard rate limit headers on every response, and include Retry-After on 429 responses.
On the client side, implement exponential backoff with jitter and always respect the Retry-After header.
Document your rate limits clearly and provide generous development-tier limits to reduce integration friction.
When debugging API issues, inspect the response headers and body carefully. A JSON formatter can help you quickly parse error responses and identify rate limit details.

Format and inspect API responses

Debugging rate limit errors? Use our JSON Formatter to pretty-print API responses, inspect error bodies and quickly find the retry information you need.

Open JSON Formatter