How do I implement rate limiting for an API gateway without adding a lot of latency?

Most teams add rate limiting to protect APIs from abuse… and then accidentally slow down every legitimate request. You don’t have to choose between safety and speed. With a fast memory layer like Redis as your rate-limiting backend, you can enforce precise limits while keeping sub-millisecond latency in the hot path.

Quick Answer: Use Redis as a shared, in‑memory counter store behind your API gateway, keep all rate‑limit decisions to a single network hop + O(1) operations (like INCR + EXPIRE), and design your rules and data structures so they can be evaluated in one round trip—often via a Lua script or RedisJSON. That gives you robust rate limiting without adding noticeable latency.

The Quick Overview

What It Is: A Redis‑backed rate limiting solution that plugs into your API gateway and enforces limits via fast, atomic operations in memory—rather than slow DB calls or heavyweight external services.
Who It Is For: Platform and API teams running gateways like Kong, NGINX, Envoy, AWS API Gateway, or custom Node/Go/Java gateways that need predictable low latency, even under high concurrency.
Core Problem Solved: Traditional rate limiters either overload your primary database or add multiple network hops, which introduces latency and jitter. Redis gives you a centralized, ultra‑fast control point so rate limiting doesn’t become your new bottleneck.

How It Works

At its core, rate limiting is just counting requests per key (user, token, IP, route) in a time window and rejecting or slowing traffic that exceeds the configured threshold. The trick is doing this atomically and fast.

Redis is well‑suited here because it’s an in‑memory data structure server with single‑threaded command execution and primitives like INCR and EXPIRE. You get atomic counters and predictable performance without complex locking logic.

A typical Redis‑backed rate limiter for an API gateway looks like this:

Request hits the gateway: The gateway extracts an identifier (API key, user ID, IP) and selects which rate‑limit policy to apply (e.g., 100 requests/minute).
Gateway checks Redis: It issues an O(1) operation—often wrapped in a Lua script—to increment a counter and set or reuse a TTL to define the window.
Decision & response: If the counter exceeds the threshold, the gateway returns an HTTP 429. Otherwise, it forwards the request to the upstream service. Latency overhead is just a single Redis round trip.

Let’s break that flow down.

1. Identify the client and policy

For each request:

Derive a rate limit key based on:
- API key / OAuth client ID / user ID
- IP or subnet (for anonymous traffic)
- Endpoint or route (e.g., GET:/v1/orders)
Map that key to a policy, such as:
- 100 requests / minute per user
- 1000 requests / minute per IP
- 10 requests / second for a high‑risk endpoint

Many gateways already support “plugins” or “filters” where this mapping can live.

2. Check and increment in Redis (atomic)

Use Redis as a shared, centralized rate‑limit store:

Use a key pattern like:
- rate:{policy}:{client}:{window}
  Example: rate:user:12345:2026-04-12T10:15
Use INCR to bump the counter
Use EXPIRE to set the window lifetime if the key is new

From Redis’s own guidance:

Building a rate limiter with Redis is easy because of two commands INCR and EXPIRE. The basic concept is that you want to limit requests to a particular service in a given time period.

Because Redis executes commands sequentially, a small script can combine these into an atomic decision with a single round trip.

Example: Simple fixed‑window rate limit with Lua

-- rate_limiter.lua
-- KEYS[1] = key (e.g. "rate:user:123:2026-04-12T10:15")
-- ARGV[1] = limit (e.g. "100")
-- ARGV[2] = ttl in seconds (e.g. "60")

local current = redis.call("INCR", KEYS[1])

if current == 1 then
  redis.call("EXPIRE", KEYS[1], ARGV[2])
end

if current > tonumber(ARGV[1]) then
  return {0, current} -- 0 = rejected
else
  return {1, current} -- 1 = allowed
end

Gateway call (pseudo‑code in Node.js):

import { createClient } from 'redis';

const redis = createClient({
  url: process.env.REDIS_URL, // Redis Cloud / Redis Software / OSS
});
await redis.connect();
const scriptSha = await redis.sendCommand(['SCRIPT', 'LOAD', luaScriptSource]);

async function checkRateLimit(clientKey) {
  const windowTtl = 60; // seconds
  const limit = 100;
  const windowKey = `rate:user:${clientKey}:${Math.floor(Date.now() / 60000)}`;

  const [allowed, current] = await redis.sendCommand([
    'EVALSHA',
    scriptSha,
    '1',
    windowKey,
    String(limit),
    String(windowTtl),
  ]);

  return { allowed: allowed === 1, current };
}

Performance impact: one Redis call per request, O(1) work, all in memory. On Redis Cloud or a well‑sized Redis Software cluster, this is typically sub‑millisecond and doesn’t dominate your gateway’s latency budget.

3. Enforce and expose rate limit metadata

If allowed is false, return 429 Too Many Requests and useful headers:

Retry-After
X-RateLimit-Limit
X-RateLimit-Remaining
X-RateLimit-Reset

You can derive reset from the key’s TTL (via PTTL) so clients know when they’ll be unblocked.

Features & Benefits Breakdown

Core Feature	What It Does	Primary Benefit
In‑memory counters with `INCR`	Tracks per‑client request counts in Redis using atomic increments.	Minimal latency and no lock contention, even at high QPS.
TTL‑based windows with `EXPIRE`	Uses key expiry to define rate‑limit windows (fixed or sliding).	Automatic cleanup, no background jobs or cron sweeps.
Gateway‑friendly integration	Works with Kong, NGINX, Envoy, or custom gateways via a single Redis call.	Simple to wire in without re‑architecting your stack.
Clustering & sharding	Store keys across Redis Cluster nodes.	Scales horizontally with your API traffic.
Metrics & observability	Integrate with Prometheus/Grafana using Redis v2 metrics and histograms.	See p95/p99 latency impact and adjust before users feel it.
Deploy anywhere	Use Redis Cloud, Redis Software, or Redis Open Source on your infra.	Match your rate limiter to your existing cloud/on‑prem strategy.

Ideal Use Cases

Best for internet‑facing APIs behind a gateway: Because every request can cheaply hit Redis for enforcement, you can defend against DDoS, brute force, and abuse without saturating your primary database or adding multiple hops.
Best for multi‑tenant SaaS with noisy neighbors: Because you can define per‑tenant and per‑endpoint policies and enforce them in memory, you can keep a single cluster safe from “noisy” tenants while preserving low latency for everyone else.

Designing for low latency from day one

Rate limiting is only “low latency” if you design it that way. Here are patterns I recommend from running high‑traffic gateways on Redis.

Keep the hot path to a single Redis round trip

Avoid:

Multiple sequential Redis calls per request (e.g., separate GET, INCR, EXPIRE, TTL).
Extra network hops (e.g., gateway → microservice → rate limiting service → Redis).

Aim for:

Gateway → Redis → decision
One EVAL/EVALSHA or one INCR + optional EXPIRE is ideal.
Policy lookups in memory at the gateway (e.g., config file, in‑process map) rather than per‑request DB reads.

Use the simplest algorithm that fits

Complex algorithms can cost CPU and latency. A sensible progression:

Fixed window (simplest)
- Example: 100 requests/minute.
- Implementation: clock‑based key (:2026-04-12T10:15) + INCR + EXPIRE.
- Tradeoff: “Burstiness” at window edges (good enough for many cases).
Sliding window (log or counter) (more precise)
- You can implement a sliding window with:
  - A sorted set of timestamps per client, or
  - A couple of overlapping fixed windows.
- This costs more Redis operations per request. For most API gateways, I use a two‑window technique (current + previous minute, weighted) to keep it cheap.
Token bucket / leaky bucket (smoothing)
- Good when you care about evenly distributed traffic.
- Implemented with a Lua script that:
  - Calculates “refilled” tokens based on last timestamp.
  - Stores tokens + last_update in a hash or JSON.
- Slightly more CPU in Redis, still O(1) and a single call.

Under heavy load, the difference in latency is mostly about additional commands per request, not about Redis itself. Start with fixed window and only move up if your API semantics demand it.

Co‑locate Redis with your gateway where possible

Network distance shows up directly in p99 latency.

If you’re on Kubernetes:
- Run Redis Software in the same region and preferably in the same AZ as your gateway pods.
- Use an internal service (ClusterIP) and local network.
If you’re on public cloud:
- Prefer Redis Cloud or a managed Redis close to your gateway (same region/AZ).
For multi‑region:
- Use regional Redis clusters per gateway region.
- Apply rate limits region‑local for performance.
- Only centralize if you truly need global limits—this often requires tradeoffs.

Implementation examples with Redis

Here are a couple of concrete patterns that stay latency‑friendly.

Example 1: Per‑IP fixed window in NGINX + Lua + Redis

Using OpenResty or NGINX with Lua:

local redis = require "resty.redis"
local r = redis:new()
r:set_timeout(1) -- 1 ms timeout

local ok, err = r:connect("redis.default.svc.cluster.local", 6379)
if not ok then
  ngx.log(ngx.ERR, "failed to connect to Redis: ", err)
  -- Option: fail open or closed depending on your policy
  return
end

local client_ip = ngx.var.remote_addr
local now = ngx.time()
local window = math.floor(now / 60) -- per minute
local key = "rate:ip:" .. client_ip .. ":" .. window

local limit = 100
local ttl = 60

local current, err = r:incr(key)
if not current then
  ngx.log(ngx.ERR, "failed to incr: ", err)
  return
end

if current == 1 then
  r:expire(key, ttl)
end

if current > limit then
  ngx.status = 429
  ngx.header["Retry-After"] = ttl
  ngx.say("Too Many Requests")
  ngx.exit(429)
end

This stays simple and avoids Lua scripting on the Redis side, while still using only two commands (INCR, EXPIRE when needed).

Example 2: Token bucket rate limiter in Redis Cloud (Node.js)

For APIs that need burst tolerance but a steady average:

// Using Redis as a token bucket:
// key -> JSON { tokens: number, last_refill_ms: number }

import { createClient } from 'redis';

const redis = createClient({ url: process.env.REDIS_URL });
await redis.connect();

const TOKEN_BUCKET_LUA = `
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2]) -- tokens/sec
local now_ms = tonumber(ARGV[3])

local bucket = redis.call("JSON.GET", key)
local tokens
local last_refill_ms

if bucket then
  local obj = cjson.decode(bucket)
  tokens = obj["tokens"]
  last_refill_ms = obj["last_refill_ms"]
else
  tokens = capacity
  last_refill_ms = now_ms
end

local elapsed_sec = (now_ms - last_refill_ms) / 1000
local refill = elapsed_sec * refill_rate
tokens = math.min(capacity, tokens + refill)
last_refill_ms = now_ms

local allowed = 0
if tokens >= 1 then
  tokens = tokens - 1
  allowed = 1
end

local new_bucket = cjson.encode({ tokens = tokens, last_refill_ms = last_refill_ms })
redis.call("JSON.SET", key, "$", new_bucket)
redis.call("EXPIRE", key, 600) -- idle buckets expire in 10 minutes

return allowed
`;

const sha = await redis.sendCommand(['SCRIPT', 'LOAD', TOKEN_BUCKET_LUA]);

async function allowRequest(clientId) {
  const key = `bucket:${clientId}`;
  const capacity = 100;
  const refillPerSecond = 10;
  const nowMs = Date.now();

  const allowed = await redis.sendCommand([
    'EVALSHA',
    sha,
    '1',
    key,
    String(capacity),
    String(refillPerSecond),
    String(nowMs),
  ]);

  return allowed === 1;
}

Using JSON here isn’t required, but it illustrates how Redis’s modern data structures (RedisJSON) can store richer state while still keeping everything in memory and in a single script.

Limitations & Considerations

Redis availability becomes a dependency:
If your gateway can’t reach Redis, you can’t enforce limits. Design a clear failure mode:
- Fail open (allow all) to preserve availability but risk abuse, or
- Fail closed (block) to protect systems but risk false positives.
Use Redis Cloud or a highly available Redis Software cluster with:
- Automatic failover
- Replicas
- Proper resource sizing
Global limits across regions add complexity:
If you try to enforce a strict global limit across multiple geographically distant gateways with a single Redis cluster, network latency will rise. Prefer:
- Regional limits with regional Redis,
- Or looser eventual‑consistency mechanisms if you truly need global enforcement.
Hot keys under skewed traffic:
If all traffic is from one client/IP, you may create a hot key. Redis handles this well at moderate QPS, but for extreme skew:
- Consider a small amount of key hashing or sharding.
- Monitor command latency histograms (p99/p99.9) using Prometheus/Grafana.

Pricing & Plans

You can implement this pattern with:

Redis Open Source (Download Redis 8):
- Best for small teams or early‑stage projects that are comfortable running Redis themselves on VMs or Kubernetes.
- You manage clustering, failover, and observability.
Redis Software (self‑managed on‑prem / hybrid):
- Best for organizations that need Redis close to existing data centers or have strict compliance requirements.
- You get enterprise features like Active-Active Geo Distribution, automatic failover, and richer observability, wired into your own infrastructure.
Redis Cloud (fully managed):
- Best for cloud‑native teams needing “set it and forget it” rate limiting at scale.
- Redis handles provisioning, scaling, resilience, and metrics, so your gateway just connects and starts issuing INCR/Lua calls.

For detailed plan info and sizing guidance, you can talk with Redis directly—most teams size their rate‑limit clusters based on peak QPS × ops per request, and Redis Cloud can model that with you.

Redis Cloud: Best for teams on AWS/Azure/GCP who want managed Redis with SLA, automatic scaling, and multi‑AZ resilience for production gateways.
Redis Software / Open Source: Best for teams with strong ops muscle who want to embed Redis next to existing infrastructure, including air‑gapped or on‑prem environments.

Frequently Asked Questions

How much latency does Redis‑based rate limiting actually add?

Short Answer: Typically well under 1 ms per request for the Redis part, if you keep it to a single call in the same region.

Details:
Redis is an in‑memory engine; INCR and EXPIRE are O(1) and extremely fast. The latency impact is dominated by network round‑trip time between the gateway and Redis. In most production setups I’ve run:

Gateway and Redis in the same AZ: ~0.2–0.7 ms p95 for the rate‑limit check.
Gateway and Redis in different regions: can jump to 10–50 ms or more—this is what you must avoid.

Instrument the gateway with:

Redis_latency_histogram (Prometheus v2 metrics)
p95/p99 dashboards in Grafana

so you can see whether rate limiting is impacting user‑visible latency and adjust placement or sizing.

Can I use my primary database (like Postgres) instead of Redis for rate limiting?

Short Answer: You can, but you probably shouldn’t—it will become a bottleneck and can hurt both latency and stability.

Details:
Traditional relational databases are optimized for durability and complex queries, not for millions of tiny, write‑heavy counters per second. If you use Postgres/MySQL for rate limiting:

You’ll create massive write amplification and contention on hot rows.
You’ll need transactions and row locks to get atomicity, adding more latency.
Under attack or high demand, your database gets hammered, and user‑facing queries slow down or fail.

Redis is designed as a fast memory layer, making it ideal for ephemeral, high‑volume counters like rate limiters. You can still keep long‑term “usage” aggregates in your primary DB (e.g., for billing) by periodically exporting from Redis, but keep the real‑time decision logic in Redis.

Summary

If you design rate limiting as “just another microservice” with its own DB, you’ll feel the pain in tail latency and fragility. By moving rate limiting into a Redis‑backed fast memory layer, you can:

Enforce per‑user, per‑IP, and per‑endpoint limits with minimal overhead.
Keep all decisions to a single Redis round trip using primitives like INCR and EXPIRE, or small Lua scripts.
Scale horizontally with Redis Cluster, monitor real‑time performance with Prometheus/Grafana, and keep your upstream services protected.

The key is to:

Place Redis close to your gateway,
Choose a simple algorithm that fits your needs,
And design for failure up front (HA, clear fail‑open/closed behavior).

You get robust rate limiting without turning it into your new latency bottleneck.

Next Step

Get Started