Top 7 Keycloak Cluster Configuration Best Practices
Last updated: March 2026
Running Keycloak in a single-node setup works fine for development, but production workloads demand a clustered deployment. A misconfigured cluster leads to dropped sessions, inconsistent login states, and degraded performance under load. Getting the configuration right from the start saves you from painful debugging at 2 AM when your authentication layer stops responding.
This guide covers seven battle-tested best practices for configuring Keycloak clusters. Each practice includes concrete configuration examples you can adapt to your own infrastructure.
1. Choose the Right Discovery Mechanism
Before Keycloak nodes can form a cluster, they need to find each other. Keycloak uses JGroups for cluster communication, and the discovery mechanism you choose depends entirely on your deployment environment.
Why It Matters
If nodes cannot discover each other, you end up with multiple independent Keycloak instances instead of a coordinated cluster. Users will get redirected to different nodes mid-flow, authentication sessions will be lost, and your login pages will throw cryptic errors.
Kubernetes: Use DNS_PING
For Kubernetes deployments, DNS_PING is the recommended discovery method. It uses DNS queries against the headless service to find other pods in the cluster.
# Keycloak environment variables for Kubernetes
KC_CACHE: ispn
KC_CACHE_STACK: kubernetes
JAVA_OPTS_APPEND: >-
-Djgroups.dns.query=keycloak-headless.keycloak-namespace.svc.cluster.local
You also need a headless service in your Kubernetes manifest:
apiVersion: v1
kind: Service
metadata:
name: keycloak-headless
namespace: keycloak-namespace
spec:
clusterIP: None
selector:
app: keycloak
ports:
- port: 7800
name: jgroups
Bare Metal or VMs: Use JDBC_PING
For traditional server deployments without Kubernetes DNS, JDBC_PING uses the existing database as a shared registry for node discovery. Each node registers itself in a database table, and other nodes query that table to find peers.
<!-- Custom Infinispan config: cache-ispn.xml -->
<jgroups>
<stack name="jdbc-ping-tcp" extends="tcp">
<JDBC_PING connection_driver="org.postgresql.Driver"
connection_url="jdbc:postgresql://db-host:5432/keycloak"
connection_username="${env.KC_DB_USERNAME}"
connection_password="${env.KC_DB_PASSWORD}"
initialize_sql="CREATE TABLE IF NOT EXISTS JGROUPSPING (
own_addr varchar(200) NOT NULL,
cluster_name varchar(200) NOT NULL,
ping_data BYTEA,
constraint PK_JGROUPSPING PRIMARY KEY (own_addr, cluster_name)
)"
info_writer_sleep_time="500"
remove_all_data_on_view_change="true"
stack.combine="REPLACE"
stack.position="MPING"/>
</stack>
</jgroups>
Then start Keycloak with:
bin/kc.sh start --cache-config-file=cache-ispn.xml
If you are deploying on Kubernetes and want to skip the manual configuration, Skycloak’s managed clusters handle discovery configuration automatically, so you can focus on your application logic instead of infrastructure plumbing.
2. Configure Infinispan Caching Properly
Keycloak uses Infinispan as its embedded caching layer. The cache configuration directly affects how your cluster handles sessions, authentication flows, and realm metadata.
Why It Matters
Incorrect cache settings cause two categories of problems. Under-provisioned caches lead to excessive database queries and slow login times. Misconfigured distributed caches cause session data to be lost when nodes leave the cluster.
Key Caches to Configure
Keycloak uses several distinct caches, each with different requirements:
<distributed-cache name="sessions" owners="2">
<expiration lifespan="-1"/>
</distributed-cache>
<distributed-cache name="authenticationSessions" owners="2">
<expiration lifespan="-1"/>
</distributed-cache>
<distributed-cache name="offlineSessions" owners="2">
<expiration lifespan="-1"/>
</distributed-cache>
<local-cache name="realms">
<encoding>
<key media-type="application/x-java-object"/>
<value media-type="application/x-java-object"/>
</encoding>
<memory max-count="10000"/>
</local-cache>
<local-cache name="users">
<encoding>
<key media-type="application/x-java-object"/>
<value media-type="application/x-java-object"/>
</encoding>
<memory max-count="10000"/>
<expiration max-idle="60000"/>
</local-cache>
Understanding the owners Parameter
The owners parameter controls how many copies of each cache entry exist across the cluster. Setting owners="2" means each session is stored on two nodes. If one node goes down, the session data survives on the other.
Setting owners too high increases memory usage and network traffic. Setting it to 1 means any node failure loses the sessions it held. For most production clusters, owners="2" strikes the right balance.
Local vs. Distributed Caches
Realm and user caches are local caches because they are read-heavy and rarely change. Distributing them would add unnecessary network overhead. Session caches must be distributed because users may hit different nodes during their authentication flow.
3. Set Up Proper Database Connection Pooling
Every Keycloak node in your cluster maintains its own database connection pool. Misconfigured pooling is one of the most common causes of cluster instability under load.
Why It Matters
When a traffic spike hits, Keycloak needs database connections immediately. If the pool is exhausted, authentication requests queue up, timeouts cascade, and your users see errors. On the other hand, setting the pool too large wastes database resources and can hit connection limits on your database server.
Recommended Configuration
# Connection pool sizing
KC_DB_POOL_INITIAL_SIZE=10
KC_DB_POOL_MIN_SIZE=10
KC_DB_POOL_MAX_SIZE=50
# Connection validation
KC_DB_URL_PROPERTIES=";connectTimeout=30;socketTimeout=30"
# Transaction timeouts
KC_TRANSACTION_XA_ENABLED=false
Sizing the Pool
A good starting point is to calculate based on your expected concurrent users:
- Small clusters (under 1,000 concurrent users):
max-size=20per node - Medium clusters (1,000 to 10,000 concurrent users):
max-size=50per node - Large clusters (over 10,000 concurrent users):
max-size=100per node
Remember that the total connections across all nodes must not exceed your database server’s maximum connection limit. If you run 4 nodes with max-size=50, your database needs to support at least 200 connections plus overhead for maintenance queries.
Connection Validation
Always enable connection validation. Database connections go stale due to network interruptions, database restarts, or firewall timeouts. Without validation, Keycloak will attempt to use dead connections and throw errors:
<datasource jndi-name="java:jboss/datasources/KeycloakDS"
pool-name="KeycloakDS">
<connection-url>jdbc:postgresql://db-host:5432/keycloak</connection-url>
<pool>
<min-pool-size>10</min-pool-size>
<max-pool-size>50</max-pool-size>
</pool>
<validation>
<valid-connection-checker class-name="org.jboss.jca.adapters.jdbc.extensions.postgres.PostgreSQLValidConnectionChecker"/>
<validate-on-match>false</validate-on-match>
<background-validation>true</background-validation>
<background-validation-millis>10000</background-validation-millis>
</validation>
</datasource>
4. Implement Sticky Sessions with Your Load Balancer
Sticky sessions (session affinity) route a user’s requests to the same Keycloak node throughout their authentication flow. While Keycloak’s distributed cache makes sticky sessions technically optional, they remain a strong recommendation for production clusters.
Why It Matters
Without sticky sessions, every request potentially hits a different node. That node must then fetch the session data from the distributed cache over the network. This adds latency to every request. With sticky sessions, the node that owns the session handles the request directly, reading from local memory instead of the network.
Configuring Nginx
upstream keycloak_backend {
ip_hash; # Basic sticky sessions based on client IP
server keycloak-node-1:8443;
server keycloak-node-2:8443;
server keycloak-node-3:8443;
}
# For more reliable session affinity, use cookie-based stickiness
upstream keycloak_backend {
server keycloak-node-1:8443;
server keycloak-node-2:8443;
server keycloak-node-3:8443;
sticky cookie KC_ROUTE expires=1h domain=.example.com httponly secure;
}
Configuring HAProxy
backend keycloak_backend
balance roundrobin
cookie KC_ROUTE insert indirect nocache httponly secure
server kc1 keycloak-node-1:8443 check cookie kc1 ssl verify none
server kc2 keycloak-node-2:8443 check cookie kc2 ssl verify none
server kc3 keycloak-node-3:8443 check cookie kc3 ssl verify none
Kubernetes Ingress
If you are running on Kubernetes with an Ingress controller:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: keycloak-ingress
annotations:
nginx.ingress.kubernetes.io/affinity: "cookie"
nginx.ingress.kubernetes.io/session-cookie-name: "KC_ROUTE"
nginx.ingress.kubernetes.io/session-cookie-expires: "3600"
nginx.ingress.kubernetes.io/session-cookie-secure: "true"
nginx.ingress.kubernetes.io/session-cookie-samesite: "Lax"
spec:
rules:
- host: auth.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: keycloak
port:
number: 8443
You can prototype different load balancer configurations locally using our Keycloak Docker Compose Generator before deploying to your production environment.
5. Configure Proper Health Checks
Health checks tell your load balancer and orchestrator whether a Keycloak node is ready to serve traffic. Getting these wrong means sending requests to nodes that are still starting up or are in a degraded state.
Why It Matters
Keycloak takes time to start. It needs to connect to the database, join the cluster, load realm configurations, and warm up caches. If your load balancer starts routing traffic before these steps complete, users will see errors. Conversely, if your liveness probe is too aggressive, Kubernetes may kill nodes that are temporarily slow but otherwise healthy.
Keycloak’s Built-in Health Endpoints
Enable the health endpoint when starting Keycloak:
bin/kc.sh start --health-enabled=true
This exposes:
/health/ready– Returns 200 when the node is ready to accept traffic/health/live– Returns 200 when the node process is running/health/started– Returns 200 when the node has completed startup
Kubernetes Probe Configuration
containers:
- name: keycloak
readinessProbe:
httpGet:
path: /health/ready
port: 9000
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
livenessProbe:
httpGet:
path: /health/live
port: 9000
initialDelaySeconds: 60
periodSeconds: 30
timeoutSeconds: 5
failureThreshold: 5
startupProbe:
httpGet:
path: /health/started
port: 9000
initialDelaySeconds: 15
periodSeconds: 5
failureThreshold: 30
Key Configuration Choices
The startupProbe is critical. It gives Keycloak up to 165 seconds (15 + 30 * 5) to complete its initial startup before Kubernetes considers it failed. Without a startup probe, the liveness probe may kill the container before it finishes starting.
Set the liveness probe’s failureThreshold higher than the readiness probe. You want to remove a slow node from the load balancer quickly (readiness) but give it more time to recover before killing it (liveness).
For a deeper look at health monitoring and real-time cluster metrics, check out how Skycloak’s Insights dashboard surfaces these signals without requiring you to build custom monitoring infrastructure.
6. Plan Your Session Failover Strategy
When a node leaves the cluster, whether due to a crash, a rolling update, or scaling down, what happens to the sessions it owned? Your failover strategy determines whether users experience a seamless transition or get logged out.
Why It Matters
In a cluster with owners="2", each session exists on two nodes. If one node goes down, the backup node still has the session data. But if both owners go down simultaneously, or if you configured owners="1", those sessions are lost.
Distributed Cache with Backup Owners
This is the default and recommended approach. Configure your session caches with at least two owners:
<distributed-cache name="sessions" owners="2">
<expiration lifespan="-1"/>
<state-transfer timeout="300000"/>
</distributed-cache>
<distributed-cache name="clientSessions" owners="2">
<expiration lifespan="-1"/>
<state-transfer timeout="300000"/>
</distributed-cache>
The state-transfer timeout controls how long a joining node waits to receive its share of the distributed cache data. Set this high enough for large session stores to transfer completely.
Graceful Shutdown for Rolling Updates
When performing rolling updates, let Keycloak drain its sessions before stopping. In Kubernetes:
spec:
terminationGracePeriodSeconds: 120
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
Setting maxUnavailable: 0 ensures a new node is fully ready before an old node is terminated. The terminationGracePeriodSeconds gives the departing node time to transfer its cache state to remaining nodes.
External Infinispan for Large Deployments
For clusters handling over 50,000 concurrent sessions, consider offloading session storage to an external Infinispan cluster:
<remote-store xmlns="urn:infinispan:config:store:remote:14.0"
cache="sessions"
purge="false"
preload="false"
shared="true"
raw-values="true">
<remote-server host="infinispan-server" port="11222"/>
<security>
<authentication>
<digest username="keycloak" password="${env.ISPN_PASSWORD}" realm="default"/>
</authentication>
</security>
</remote-store>
This separates session persistence from Keycloak’s lifecycle, so you can restart or scale the Keycloak cluster without losing any sessions.
7. Monitor Cluster Health
A running cluster is not necessarily a healthy cluster. Without monitoring, you will not know about degraded nodes, cache imbalances, or growing response times until users start complaining.
Why It Matters
Cluster issues often develop gradually. A single slow database query, a network partition between two nodes, or a memory leak in one instance can cascade into a full outage if left undetected. Monitoring gives you early warning before small issues become big problems.
Enable Prometheus Metrics
bin/kc.sh start --metrics-enabled=true
This exposes a Prometheus-compatible endpoint at /metrics on the management port (default 9000). Key metrics to track:
# Scrape config for Prometheus
- job_name: 'keycloak'
metrics_path: /metrics
static_configs:
- targets:
- keycloak-node-1:9000
- keycloak-node-2:9000
- keycloak-node-3:9000
Critical Metrics to Alert On
| Metric | Description | Alert Threshold |
|---|---|---|
keycloak_request_duration_seconds |
Login flow response time | p95 > 2 seconds |
vendor_cache_manager_default_cluster_size |
Number of nodes in cluster | Less than expected count |
vendor_cache_manager_default_cache_sessions_statistics_entries |
Active session count | Sudden drop > 20% |
jvm_memory_used_bytes |
JVM heap usage | Over 85% of max |
db_pool_active_count |
Active DB connections | Over 80% of max pool |
Infinispan Cache Statistics
Enable cache statistics to monitor cache hit rates and evictions:
<distributed-cache name="sessions" owners="2" statistics="true">
<expiration lifespan="-1"/>
</distributed-cache>
<local-cache name="realms" statistics="true">
<memory max-count="10000"/>
</local-cache>
Then monitor these Infinispan metrics:
- Hit ratio: If the cache hit ratio for realm or user caches drops below 90%, your
max-countis too low and entries are being evicted too aggressively. - Number of entries: Track growth over time to plan capacity.
- Average read/write time: Spikes indicate network issues between cluster nodes.
Sample Grafana Dashboard Query
# Cluster size over time
vendor_cache_manager_default_cluster_size
# Login latency p95
histogram_quantile(0.95, rate(keycloak_request_duration_seconds_bucket[5m]))
# Cache hit ratio for sessions
vendor_cache_manager_default_cache_sessions_statistics_hit_ratio
For production deployments where building custom dashboards is not a priority, Skycloak’s managed hosting includes pre-configured monitoring with alerting out of the box.
Putting It All Together
These seven practices work as a system. Discovery lets nodes find each other. Caching keeps session data consistent. Database pooling handles load spikes. Sticky sessions reduce latency. Health checks route traffic correctly. Failover keeps users logged in during disruptions. Monitoring tells you when something drifts.
Here is a summary checklist to validate your cluster configuration:
- Discovery mechanism matches your deployment environment (DNS_PING or JDBC_PING)
- Distributed caches have
owners="2"for session data - Database pool sizes account for all nodes in the cluster
- Load balancer uses cookie-based session affinity
- Readiness, liveness, and startup probes are configured with appropriate thresholds
- Rolling update strategy prevents simultaneous node termination
- Prometheus metrics are enabled and dashboards track the five critical metrics
If you want to skip the infrastructure work and focus on building your application, Skycloak manages all of this for you with production-grade Keycloak clusters that follow these best practices by default. You can also explore the full Skycloak documentation to learn more about the platform’s cluster management capabilities.
Frequently Asked Questions
How many nodes should a Keycloak cluster have?
Start with three nodes for production. Two nodes provide redundancy but leave no room for a rolling update without reducing capacity. Three nodes let you take one offline for maintenance while keeping two active.
Can I mix Keycloak versions across cluster nodes?
No. All nodes in a cluster must run the same Keycloak version. Mixed versions cause serialization errors in the distributed cache because cache entry formats change between releases.
Does Keycloak support active-active multi-region clustering?
Keycloak supports cross-datacenter replication using external Infinispan with cross-site replication. This requires careful configuration of Infinispan’s relay protocol and is significantly more complex than single-datacenter clustering. For most organizations, an active-passive multi-region setup with database replication is simpler to operate.
What happens to sessions during a Keycloak upgrade?
With owners="2" and a rolling update strategy (maxUnavailable: 0), sessions survive upgrades. The new node joins the cluster, receives its cache state, and becomes ready before the old node is terminated. However, major version upgrades may require a full restart if the session serialization format changes.
Ready to simplify your authentication?
Deploy production-ready Keycloak in minutes. Unlimited users, flat pricing, no SSO tax.