Keycloak on Redis: Benchmarking Locke vs Embedded Infinispan

Last updated: May 2026

TL;DR

We ran a production-mode benchmark (Keycloak 26.6.1, start --optimized, 3-pod cluster on dedicated cloud nodes) comparing Keycloak’s cache layer on Redis (via Locke, our Apache 2.0 distribution) against stock Keycloak’s embedded Infinispan. Headline results:

Throughput: a tie. Redis matched Infinispan within ~0.1% up to 250 logins/sec, zero errors on both.
Resilience: the real win. When a node dies, Infinispan stalls 31-40 seconds on a JGroups rebalance (p99 31,033ms / 39,688ms for one / two nodes down). Locke keeps serving from Redis at sub-second p99 (908ms / 2,609ms), roughly 15-34x better.
Upgrades. Across an Infinispan-version boundary, the recommended Keycloak path is a brief planned restart (Keycloak starts in well under 10s); Locke can do that same upgrade as a rolling update because there is no JGroups version handshake. Keycloak does not guarantee no-downtime minor upgrades in general, so treat this as one fewer constraint, not a blanket zero-downtime promise.
Latency: a small, honest trade. A Redis round trip adds a few milliseconds at moderate load. Both stayed under 170ms p99 to 250 logins/sec.
Redis does not raise Keycloak’s realm-count ceiling. That limit is database-bound, not cache-bound. We say so plainly.

Why operators keep asking for Redis caching in Keycloak

Keycloak ships with embedded Infinispan for its caches and clustering, coordinated over JGroups. That works, but operators running Keycloak at scale keep hitting the same operational wall, and they have been asking for a Redis option for years. Keycloak discussion #11074 (“Using Redis as a cache store”) is one of the most-reacted infrastructure threads in the project, full of teams who already run managed Redis (ElastiCache, MemoryStore, Azure Cache) and would rather point Keycloak at it than operate a JGroups cluster.

The appeal is operational, not theoretical. JGroups discovery, cluster formation, split-brain handling, and state transfer on node changes are real things to get right. A managed Redis is something most platform teams already run and already know how to scale and monitor. The open question has always been: does moving Keycloak’s caches to Redis cost you throughput or latency? We built Locke to answer that with numbers instead of opinions.

What Locke is

Locke is an Apache 2.0 distribution of Keycloak that adds Redis as a cache backend behind a single switch (KC_CACHE=infinispan|redis). When set to redis, Locke keeps Keycloak’s realm, user, and authorization caches as in-JVM L1 caches (Caffeine, bounded, short TTL) and propagates cross-node invalidation over Redis pub/sub instead of JGroups, while the distributed session-class caches (authentication sessions, login failures, single-use tokens) move to a shared Redis. There is no JGroups cluster. User sessions delegate to the JPA store.

Crucially, Locke is the stock Keycloak codebase plus this cache backend: same admin console, same SPIs, same database schema, same everything else. Flip the switch back to infinispan and you have ordinary Keycloak. That makes a clean apples-to-apples comparison possible, which is exactly what we wanted to measure.

How we tested

We wanted numbers an operator can trust, so we ran in production mode, not dev mode.

Cluster: a managed Kubernetes cluster with a dedicated node pool of 4 machines (16 vCPU / 64 GB each). Three nodes ran one Keycloak pod each (enforced anti-affinity, so it is a real 3-machine cluster); a fourth node ran the load generator, PostgreSQL, and Redis, isolated from the Keycloak pods so results reflect the cache backend rather than host contention.
Mode: both stacks built with kc.sh build and started with start --optimized (providers baked at build time). Identical resources: pinned 6 vCPU / 6 GB, -Xms2g -Xmx4g, DB pool 50, same realm, same max_connections.
Stacks: A3 = stock Keycloak 26.6.1 with KC_CACHE=ispn (JGroups jdbc-ping, confirmed 3-member view). B3 = Locke 26.6.1 with KC_CACHE=redis.
Load: the official keycloak-benchmark Gatling harness, AuthorizationCode scenario (browser login, credentials POST, code exchange, token refresh, logout), against a realm with 100 users and one confidential client. Stacks ran sequentially on the same nodes; the idle stack was scaled to zero so the numbers are clean.

All numbers below come from that run. The full report and raw evidence live in the Locke repository.

Throughput: Redis is not slower

Locke matched embedded Infinispan within ~0.1% across the entire load range, with zero errors on both stacks up to 250 logins per second. This is the result that surprises people, so here it is in full:

Load (logins/sec)	Infinispan (req/s)	Locke / Redis (req/s)	Parity
80	274.1	274.5	100.1%
160	548.1	548.1	100.0%
250	855.5	856.4	100.1%

The intuition behind the tie: in a real cluster, embedded Infinispan pays an N-by-N JGroups coordination and replication cost that the Redis path avoids. Redis adds a network hop on a cache miss; Infinispan adds cluster chatter. At this scale they net out, and clustered Keycloak on Redis serves the same login throughput as embedded Infinispan, error-free, all the way to saturation.

Resilience: this is the reason to move off JGroups

When a node is lost under load, embedded Infinispan stalls for tens of seconds rebalancing the cluster; Locke keeps serving from Redis at sub-second latency. This is the headline result, and it is not close.

We sustained 80 logins/sec for 150 seconds and killed pods at the 45-second mark.

Scenario	Locke / Redis	Stock / Infinispan
1 node down	0.01% failed, p99 908ms	0.46% failed, p99 31,033ms
2 nodes down	0.18% failed, p99 2,609ms	0.35% failed, p99 39,688ms

When an Infinispan node dies, the surviving members spend roughly 31-40 seconds on a JGroups rebalance and state transfer, and in-flight requests hang into the tens of seconds while that happens. Locke simply keeps serving from Redis: p99 stays sub-second with one node down and low-single-digit-seconds with two. For an operator on call, that is the difference between “a pod restarted and nobody noticed” and “authentication latency spiked to 31 seconds for half a minute.” If you have ever watched a Keycloak cluster do a state transfer during an incident, that number lands.

Latency: a small, honest trade

At moderate load, embedded Infinispan has slightly lower per-operation latency than Locke, because its realm and user reads happen in-process while Locke adds a Redis round trip on a cache miss. We are not going to hide that.

Load	Infinispan p99	Locke p99
80	64ms	104ms
160	64ms	91ms
250	167ms	151ms

Both stay comfortably under 170ms p99 up to 250 logins/sec, and at the top of the range Locke is actually marginally lower. The honest read: you trade a few milliseconds of read latency for the resilience and operational gains. For most teams running Keycloak as critical auth infrastructure, that trade is obviously worth it. If you are latency-bound at single-digit milliseconds on cache reads, measure for your own workload.

Routine operations: rolling restarts and scaling are clean

Planned operations under load were a non-event on both backends. A rolling restart of all three pods completed with zero failures on both (rollout 83s on Infinispan, 79s on Locke). Scaling 3 to 2 to 3 saw the new pod join and serve within ~20-30 seconds, again with zero failures on both. Nothing surprising here, which is the point.

Cross-version upgrades: rolling vs a brief planned restart

Locke can roll a cross-version cluster upgrade under load because there is no JGroups protocol version to keep compatible across pods. That is the subtle difference, with one important caveat up front: Keycloak itself does not guarantee no-downtime minor upgrades in general, so this is one fewer constraint, not a blanket zero-downtime promise.

The honest comparison. Embedded Infinispan clusters communicate over JGroups, and a rolling upgrade that crosses an incompatible Infinispan major version (for example 15 to 16) cannot form a unified cluster view during the mixed-version window. The Keycloak docs explicitly note that mixed-minor rolling does not work across an Infinispan version boundary, so the operation the project actually recommends for this case is a brief planned restart: shut the old version down, start the new one. Since Keycloak starts in well under 10 seconds, that lands you at roughly that much downtime. We did try the mixed-version rolling path in our benchmark and it stalled for about 84 seconds; that is the cost of forcing a rolling attempt across the boundary, not a fair “no-Locke alternative” baseline.

Locke does not have that barrier. We ran two cross-version upgrades on Locke as true rolling updates under load:

Upgrade	Cutover behavior	End state
26.3.5 to 26.6.1 (full DB migration)	~85s of elevated latency, p99 peak ~2.2s, one 0.02% blip	clean on 26.6.1, p99 ~100ms, 0 failures
26.6.1 to 26.6.2 (patch)	brief p99 bump	clean on 26.6.2, 0 failures throughout

Both completed gracefully under load. The realistic operational read: where stock Keycloak would want a brief planned restart across that boundary, Locke can do the same upgrade as a rolling update. Small in absolute terms (~10s of downtime saved), but a real difference for strict zero-downtime setups, with the upstream-schema caveat above.

Adopting Locke: the same-version migration

If you are switching an existing Keycloak deployment to Locke without changing your Keycloak version, the database schema is identical, so there is no migration to run. We rolled a 3-pod cluster from stock Infinispan to Locke/Redis under sustained login load. It landed at full throughput parity (201.7 req/s after, vs 204.8 before, both zero failures) with a bounded ~60-75 second cutover window during which p99 briefly spiked to ~1.3s and ~0.02% of requests failed, driven by the Infinispan pods leaving the JGroups cluster as they were replaced, not by Locke.

The practical takeaway: for a seamless switch, do the cutover in a short maintenance window or drain pods one at a time. After the cutover, you are running at parity on Redis.

Redis placement and failure behavior

Two questions every operator asks about an external cache: where do I put it, and what happens when it dies?

Placement is negligible. We compared Redis colocated on a Keycloak node against Redis on a separate node (a network hop from every pod). Throughput was identical and p99 differed by only ~6-20ms below saturation. Locke’s in-JVM L1 cache absorbs most reads, so the Redis round trip rarely gates a request. A managed, external Redis on the same network performs the same as a colocated one, so you do not need to co-locate Redis with Keycloak.

A Redis outage degrades fast and recovers. Locke’s Redis client is bounded by a command timeout, so when Redis disappears, cache operations fail fast (the configured timeout) rather than hanging request threads, and the service recovers automatically once Redis returns and the client reconnects. Because Locke’s caches genuinely need Redis, the cache path does return errors while Redis is fully down, so for zero-downtime through a Redis failure, run Redis in a highly-available topology (Sentinel or Cluster), exactly as you would for any service that depends on Redis.

The honest part: Redis does not fix the realm-count ceiling

It would be easy to imply that moving caches to Redis makes Keycloak scale to tens of thousands of realms. It does not, and we are not going to pretend otherwise.

We bulk-created thousands of realms on both stacks. The well-known slow operation, GET /admin/realms (list all realms), degrades from ~0.1s to minutes past a few hundred realms, identically on Infinispan and Locke. That operation is bound by the database and response serialization, not by the cache backend, so no cache choice fixes it. This matches the broader discussion in the Keycloak community (#11074). Locke’s only realm-scaling effect is to bound the in-heap cache for predictable memory, rather than grow it unbounded. Realm-count scaling is a Keycloak and database concern; Locke’s value is operational, not a higher realm ceiling.

How this compares to Phase Two’s Redis work

Locke is not the only Redis effort in the Keycloak ecosystem, and the projects are complementary rather than competing. Phase Two (p2-inc/keycloak-redis-cache) take a different route: a DatastoreProvider extension that moves Keycloak’s distributed caches (sessions) to Redis/Valkey while keeping local Infinispan for realm and user caches.

That difference predicts the opposite latency result. Sessions are a replicated Infinispan cache, so moving them to Redis removes replication cost and per-operation latency can improve. Locke keeps the local caches (realm, user, authorization) in-JVM, so those reads do not add a round trip; the small latency Locke adds comes instead from the per-request session-class caches (authentication sessions, login failures, single-use tokens) it writes to Redis on the login path. The two projects optimize different layers, and the strongest combined story for Keycloak-on-Redis is sessions and the local caches both off JGroups, which is also where the no-rebalance-on-node-loss resilience win comes from.

When does Locke make sense?

Based on these results, Locke is a strong fit if you:

Already operate a managed Redis and would rather not run a JGroups cluster.
Care more about graceful behavior during node loss and upgrades than about shaving a few milliseconds off cache reads.
Want to do cross-version cluster upgrades as rolling updates instead of needing a brief planned restart.
Are running Keycloak as critical infrastructure where a 31-second auth stall during a routine node replacement is unacceptable.

Stick with embedded Infinispan if you want the lowest possible read latency, you are comfortable operating JGroups, and your node-change events are rare and well-controlled.

Frequently asked questions

Is Keycloak on Redis slower than embedded Infinispan?

No, not on throughput. In a 3-pod production benchmark, Locke (Redis) matched embedded Infinispan within ~0.1% up to 250 logins/sec with zero errors on both. Per-operation latency is a few milliseconds higher on Redis at moderate load because of the network round trip, but both stayed under 170ms p99 to 250 logins/sec.

What happens to Keycloak when a node fails on Redis vs Infinispan?

With embedded Infinispan, losing a node triggers a JGroups rebalance and state transfer that stalled the cluster for 31-40 seconds in our test (p99 up to ~39s), hanging in-flight requests. With Locke on Redis there is no JGroups rebalance; surviving pods keep serving from Redis at sub-second p99 (908ms with one node down, 2,609ms with two).

Does using Redis let Keycloak handle more realms?

No. The realm-count ceiling in Keycloak (for example the slow GET /admin/realms past a few hundred realms) is bound by the database and serialization, not the cache backend. It behaves identically on Infinispan and Locke. Redis improves operational resilience, not the realm ceiling.

What happens if Redis goes down?

Locke’s Redis client is bounded by a command timeout, so cache operations fail fast instead of hanging request threads, and the service recovers automatically when Redis returns. Because the caches depend on Redis, the cache path does error while Redis is fully unavailable, so run Redis in a highly-available topology (Sentinel or Cluster) for zero-downtime.

Do I need to run Redis on the same node as Keycloak?

No. In our benchmark, Redis placement (colocated vs a separate node) made a negligible difference (p99 within ~6-20ms below saturation) because Locke’s in-JVM L1 cache absorbs most reads. A managed external Redis on the same network is fine.

Is Locke production-ready and open source?

Locke is Apache 2.0 licensed and is the stock Keycloak codebase plus a Redis cache backend behind the KC_CACHE switch. These benchmarks were run in production mode (start --optimized) on a real 3-pod cluster. As with any cache-dependent system, run Redis in an HA topology for production.