Top 7 Keycloak Cluster Configuration Best Practices

Guilliano Molaire Guilliano Molaire Updated March 15, 2026 11 min read

Last updated: March 2026
Running Keycloak in a single-node setup works fine for development, but production workloads demand a clustered deployment. A misconfigured cluster leads to dropped sessions, inconsistent login states, and degraded performance under load. Getting the configuration right from the start saves you from painful debugging at 2 AM when your authentication layer stops responding.

This guide covers seven battle-tested best practices for configuring Keycloak clusters. Each practice includes concrete configuration examples you can adapt to your own infrastructure.

1. Choose the Right Discovery Mechanism

Before Keycloak nodes can form a cluster, they need to find each other. Keycloak uses JGroups for cluster communication, and the discovery mechanism you choose depends entirely on your deployment environment.

Why It Matters

If nodes cannot discover each other, you end up with multiple independent Keycloak instances instead of a coordinated cluster. Users will get redirected to different nodes mid-flow, authentication sessions will be lost, and your login pages will throw cryptic errors.

Kubernetes: Use DNS_PING

For Kubernetes deployments, DNS_PING is the recommended discovery method. It uses DNS queries against the headless service to find other pods in the cluster.

# Keycloak environment variables for Kubernetes
KC_CACHE: ispn
KC_CACHE_STACK: kubernetes
JAVA_OPTS_APPEND: >-
  -Djgroups.dns.query=keycloak-headless.keycloak-namespace.svc.cluster.local

You also need a headless service in your Kubernetes manifest:

apiVersion: v1
kind: Service
metadata:
  name: keycloak-headless
  namespace: keycloak-namespace
spec:
  clusterIP: None
  selector:
    app: keycloak
  ports:
    - port: 7800
      name: jgroups

Bare Metal or VMs: Use JDBC_PING

For traditional server deployments without Kubernetes DNS, JDBC_PING uses the existing database as a shared registry for node discovery. Each node registers itself in a database table, and other nodes query that table to find peers.

<!-- Custom Infinispan config: cache-ispn.xml -->
<jgroups>
  <stack name="jdbc-ping-tcp" extends="tcp">
    <JDBC_PING connection_driver="org.postgresql.Driver"
               connection_url="jdbc:postgresql://db-host:5432/keycloak"
               connection_username="${env.KC_DB_USERNAME}"
               connection_password="${env.KC_DB_PASSWORD}"
               initialize_sql="CREATE TABLE IF NOT EXISTS JGROUPSPING (
                 own_addr varchar(200) NOT NULL,
                 cluster_name varchar(200) NOT NULL,
                 ping_data BYTEA,
                 constraint PK_JGROUPSPING PRIMARY KEY (own_addr, cluster_name)
               )"
               info_writer_sleep_time="500"
               remove_all_data_on_view_change="true"
               stack.combine="REPLACE"
               stack.position="MPING"/>
  </stack>
</jgroups>

Then start Keycloak with:

bin/kc.sh start --cache-config-file=cache-ispn.xml

If you are deploying on Kubernetes and want to skip the manual configuration, Skycloak’s managed clusters handle discovery configuration automatically, so you can focus on your application logic instead of infrastructure plumbing.

2. Configure Infinispan Caching Properly

Keycloak uses Infinispan as its embedded caching layer. The cache configuration directly affects how your cluster handles sessions, authentication flows, and realm metadata.

Why It Matters

Incorrect cache settings cause two categories of problems. Under-provisioned caches lead to excessive database queries and slow login times. Misconfigured distributed caches cause session data to be lost when nodes leave the cluster.

Key Caches to Configure

Keycloak uses several distinct caches, each with different requirements:

<distributed-cache name="sessions" owners="2">
  <expiration lifespan="-1"/>
</distributed-cache>

<distributed-cache name="authenticationSessions" owners="2">
  <expiration lifespan="-1"/>
</distributed-cache>

<distributed-cache name="offlineSessions" owners="2">
  <expiration lifespan="-1"/>
</distributed-cache>

<local-cache name="realms">
  <encoding>
    <key media-type="application/x-java-object"/>
    <value media-type="application/x-java-object"/>
  </encoding>
  <memory max-count="10000"/>
</local-cache>

<local-cache name="users">
  <encoding>
    <key media-type="application/x-java-object"/>
    <value media-type="application/x-java-object"/>
  </encoding>
  <memory max-count="10000"/>
  <expiration max-idle="60000"/>
</local-cache>

Understanding the owners Parameter

The owners parameter controls how many copies of each cache entry exist across the cluster. Setting owners="2" means each session is stored on two nodes. If one node goes down, the session data survives on the other.

Setting owners too high increases memory usage and network traffic. Setting it to 1 means any node failure loses the sessions it held. For most production clusters, owners="2" strikes the right balance.

Local vs. Distributed Caches

Realm and user caches are local caches because they are read-heavy and rarely change. Distributing them would add unnecessary network overhead. Session caches must be distributed because users may hit different nodes during their authentication flow.

3. Set Up Proper Database Connection Pooling

Every Keycloak node in your cluster maintains its own database connection pool. Misconfigured pooling is one of the most common causes of cluster instability under load.

Why It Matters

When a traffic spike hits, Keycloak needs database connections immediately. If the pool is exhausted, authentication requests queue up, timeouts cascade, and your users see errors. On the other hand, setting the pool too large wastes database resources and can hit connection limits on your database server.

Recommended Configuration

# Connection pool sizing
KC_DB_POOL_INITIAL_SIZE=10
KC_DB_POOL_MIN_SIZE=10
KC_DB_POOL_MAX_SIZE=50

# Connection validation
KC_DB_URL_PROPERTIES=";connectTimeout=30;socketTimeout=30"

# Transaction timeouts
KC_TRANSACTION_XA_ENABLED=false

Sizing the Pool

A good starting point is to calculate based on your expected concurrent users:

  • Small clusters (under 1,000 concurrent users): max-size=20 per node
  • Medium clusters (1,000 to 10,000 concurrent users): max-size=50 per node
  • Large clusters (over 10,000 concurrent users): max-size=100 per node

Remember that the total connections across all nodes must not exceed your database server’s maximum connection limit. If you run 4 nodes with max-size=50, your database needs to support at least 200 connections plus overhead for maintenance queries.

Connection Validation

Always enable connection validation. Database connections go stale due to network interruptions, database restarts, or firewall timeouts. Without validation, Keycloak will attempt to use dead connections and throw errors:

<datasource jndi-name="java:jboss/datasources/KeycloakDS"
            pool-name="KeycloakDS">
  <connection-url>jdbc:postgresql://db-host:5432/keycloak</connection-url>
  <pool>
    <min-pool-size>10</min-pool-size>
    <max-pool-size>50</max-pool-size>
  </pool>
  <validation>
    <valid-connection-checker class-name="org.jboss.jca.adapters.jdbc.extensions.postgres.PostgreSQLValidConnectionChecker"/>
    <validate-on-match>false</validate-on-match>
    <background-validation>true</background-validation>
    <background-validation-millis>10000</background-validation-millis>
  </validation>
</datasource>

4. Implement Sticky Sessions with Your Load Balancer

Sticky sessions (session affinity) route a user’s requests to the same Keycloak node throughout their authentication flow. While Keycloak’s distributed cache makes sticky sessions technically optional, they remain a strong recommendation for production clusters.

Why It Matters

Without sticky sessions, every request potentially hits a different node. That node must then fetch the session data from the distributed cache over the network. This adds latency to every request. With sticky sessions, the node that owns the session handles the request directly, reading from local memory instead of the network.

Configuring Nginx

upstream keycloak_backend {
    ip_hash;  # Basic sticky sessions based on client IP

    server keycloak-node-1:8443;
    server keycloak-node-2:8443;
    server keycloak-node-3:8443;
}

# For more reliable session affinity, use cookie-based stickiness
upstream keycloak_backend {
    server keycloak-node-1:8443;
    server keycloak-node-2:8443;
    server keycloak-node-3:8443;

    sticky cookie KC_ROUTE expires=1h domain=.example.com httponly secure;
}

Configuring HAProxy

backend keycloak_backend
    balance roundrobin
    cookie KC_ROUTE insert indirect nocache httponly secure
    server kc1 keycloak-node-1:8443 check cookie kc1 ssl verify none
    server kc2 keycloak-node-2:8443 check cookie kc2 ssl verify none
    server kc3 keycloak-node-3:8443 check cookie kc3 ssl verify none

Kubernetes Ingress

If you are running on Kubernetes with an Ingress controller:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: keycloak-ingress
  annotations:
    nginx.ingress.kubernetes.io/affinity: "cookie"
    nginx.ingress.kubernetes.io/session-cookie-name: "KC_ROUTE"
    nginx.ingress.kubernetes.io/session-cookie-expires: "3600"
    nginx.ingress.kubernetes.io/session-cookie-secure: "true"
    nginx.ingress.kubernetes.io/session-cookie-samesite: "Lax"
spec:
  rules:
    - host: auth.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: keycloak
                port:
                  number: 8443

You can prototype different load balancer configurations locally using our Keycloak Docker Compose Generator before deploying to your production environment.

5. Configure Proper Health Checks

Health checks tell your load balancer and orchestrator whether a Keycloak node is ready to serve traffic. Getting these wrong means sending requests to nodes that are still starting up or are in a degraded state.

Why It Matters

Keycloak takes time to start. It needs to connect to the database, join the cluster, load realm configurations, and warm up caches. If your load balancer starts routing traffic before these steps complete, users will see errors. Conversely, if your liveness probe is too aggressive, Kubernetes may kill nodes that are temporarily slow but otherwise healthy.

Keycloak’s Built-in Health Endpoints

Enable the health endpoint when starting Keycloak:

bin/kc.sh start --health-enabled=true

This exposes:

  • /health/ready – Returns 200 when the node is ready to accept traffic
  • /health/live – Returns 200 when the node process is running
  • /health/started – Returns 200 when the node has completed startup

Kubernetes Probe Configuration

containers:
  - name: keycloak
    readinessProbe:
      httpGet:
        path: /health/ready
        port: 9000
      initialDelaySeconds: 30
      periodSeconds: 10
      timeoutSeconds: 5
      failureThreshold: 3
    livenessProbe:
      httpGet:
        path: /health/live
        port: 9000
      initialDelaySeconds: 60
      periodSeconds: 30
      timeoutSeconds: 5
      failureThreshold: 5
    startupProbe:
      httpGet:
        path: /health/started
        port: 9000
      initialDelaySeconds: 15
      periodSeconds: 5
      failureThreshold: 30

Key Configuration Choices

The startupProbe is critical. It gives Keycloak up to 165 seconds (15 + 30 * 5) to complete its initial startup before Kubernetes considers it failed. Without a startup probe, the liveness probe may kill the container before it finishes starting.

Set the liveness probe’s failureThreshold higher than the readiness probe. You want to remove a slow node from the load balancer quickly (readiness) but give it more time to recover before killing it (liveness).

For a deeper look at health monitoring and real-time cluster metrics, check out how Skycloak’s Insights dashboard surfaces these signals without requiring you to build custom monitoring infrastructure.

6. Plan Your Session Failover Strategy

When a node leaves the cluster, whether due to a crash, a rolling update, or scaling down, what happens to the sessions it owned? Your failover strategy determines whether users experience a seamless transition or get logged out.

Why It Matters

In a cluster with owners="2", each session exists on two nodes. If one node goes down, the backup node still has the session data. But if both owners go down simultaneously, or if you configured owners="1", those sessions are lost.

Distributed Cache with Backup Owners

This is the default and recommended approach. Configure your session caches with at least two owners:

<distributed-cache name="sessions" owners="2">
  <expiration lifespan="-1"/>
  <state-transfer timeout="300000"/>
</distributed-cache>

<distributed-cache name="clientSessions" owners="2">
  <expiration lifespan="-1"/>
  <state-transfer timeout="300000"/>
</distributed-cache>

The state-transfer timeout controls how long a joining node waits to receive its share of the distributed cache data. Set this high enough for large session stores to transfer completely.

Graceful Shutdown for Rolling Updates

When performing rolling updates, let Keycloak drain its sessions before stopping. In Kubernetes:

spec:
  terminationGracePeriodSeconds: 120
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0

Setting maxUnavailable: 0 ensures a new node is fully ready before an old node is terminated. The terminationGracePeriodSeconds gives the departing node time to transfer its cache state to remaining nodes.

External Infinispan for Large Deployments

For clusters handling over 50,000 concurrent sessions, consider offloading session storage to an external Infinispan cluster:

<remote-store xmlns="urn:infinispan:config:store:remote:14.0"
              cache="sessions"
              purge="false"
              preload="false"
              shared="true"
              raw-values="true">
  <remote-server host="infinispan-server" port="11222"/>
  <security>
    <authentication>
      <digest username="keycloak" password="${env.ISPN_PASSWORD}" realm="default"/>
    </authentication>
  </security>
</remote-store>

This separates session persistence from Keycloak’s lifecycle, so you can restart or scale the Keycloak cluster without losing any sessions.

7. Monitor Cluster Health

A running cluster is not necessarily a healthy cluster. Without monitoring, you will not know about degraded nodes, cache imbalances, or growing response times until users start complaining.

Why It Matters

Cluster issues often develop gradually. A single slow database query, a network partition between two nodes, or a memory leak in one instance can cascade into a full outage if left undetected. Monitoring gives you early warning before small issues become big problems.

Enable Prometheus Metrics

bin/kc.sh start --metrics-enabled=true

This exposes a Prometheus-compatible endpoint at /metrics on the management port (default 9000). Key metrics to track:

# Scrape config for Prometheus
- job_name: 'keycloak'
  metrics_path: /metrics
  static_configs:
    - targets:
      - keycloak-node-1:9000
      - keycloak-node-2:9000
      - keycloak-node-3:9000

Critical Metrics to Alert On

Metric Description Alert Threshold
keycloak_request_duration_seconds Login flow response time p95 > 2 seconds
vendor_cache_manager_default_cluster_size Number of nodes in cluster Less than expected count
vendor_cache_manager_default_cache_sessions_statistics_entries Active session count Sudden drop > 20%
jvm_memory_used_bytes JVM heap usage Over 85% of max
db_pool_active_count Active DB connections Over 80% of max pool

Infinispan Cache Statistics

Enable cache statistics to monitor cache hit rates and evictions:

<distributed-cache name="sessions" owners="2" statistics="true">
  <expiration lifespan="-1"/>
</distributed-cache>

<local-cache name="realms" statistics="true">
  <memory max-count="10000"/>
</local-cache>

Then monitor these Infinispan metrics:

  • Hit ratio: If the cache hit ratio for realm or user caches drops below 90%, your max-count is too low and entries are being evicted too aggressively.
  • Number of entries: Track growth over time to plan capacity.
  • Average read/write time: Spikes indicate network issues between cluster nodes.

Sample Grafana Dashboard Query

# Cluster size over time
vendor_cache_manager_default_cluster_size

# Login latency p95
histogram_quantile(0.95, rate(keycloak_request_duration_seconds_bucket[5m]))

# Cache hit ratio for sessions
vendor_cache_manager_default_cache_sessions_statistics_hit_ratio

For production deployments where building custom dashboards is not a priority, Skycloak’s managed hosting includes pre-configured monitoring with alerting out of the box.

Putting It All Together

These seven practices work as a system. Discovery lets nodes find each other. Caching keeps session data consistent. Database pooling handles load spikes. Sticky sessions reduce latency. Health checks route traffic correctly. Failover keeps users logged in during disruptions. Monitoring tells you when something drifts.

Here is a summary checklist to validate your cluster configuration:

  1. Discovery mechanism matches your deployment environment (DNS_PING or JDBC_PING)
  2. Distributed caches have owners="2" for session data
  3. Database pool sizes account for all nodes in the cluster
  4. Load balancer uses cookie-based session affinity
  5. Readiness, liveness, and startup probes are configured with appropriate thresholds
  6. Rolling update strategy prevents simultaneous node termination
  7. Prometheus metrics are enabled and dashboards track the five critical metrics

If you want to skip the infrastructure work and focus on building your application, Skycloak manages all of this for you with production-grade Keycloak clusters that follow these best practices by default. You can also explore the full Skycloak documentation to learn more about the platform’s cluster management capabilities.

Frequently Asked Questions

How many nodes should a Keycloak cluster have?

Start with three nodes for production. Two nodes provide redundancy but leave no room for a rolling update without reducing capacity. Three nodes let you take one offline for maintenance while keeping two active.

Can I mix Keycloak versions across cluster nodes?

No. All nodes in a cluster must run the same Keycloak version. Mixed versions cause serialization errors in the distributed cache because cache entry formats change between releases.

Does Keycloak support active-active multi-region clustering?

Keycloak supports cross-datacenter replication using external Infinispan with cross-site replication. This requires careful configuration of Infinispan’s relay protocol and is significantly more complex than single-datacenter clustering. For most organizations, an active-passive multi-region setup with database replication is simpler to operate.

What happens to sessions during a Keycloak upgrade?

With owners="2" and a rolling update strategy (maxUnavailable: 0), sessions survive upgrades. The new node joins the cluster, receives its cache state, and becomes ready before the old node is terminated. However, major version upgrades may require a full restart if the session serialization format changes.

Guilliano Molaire
Written by Guilliano Molaire Founder

Guilliano is the founder of Skycloak and a cloud infrastructure specialist with deep expertise in product development and scaling SaaS products. He discovered Keycloak while consulting on enterprise IAM and built Skycloak to make managed Keycloak accessible to teams of every size.

Ready to simplify your authentication?

Deploy production-ready Keycloak in minutes. Unlimited users, flat pricing, no SSO tax.

© 2026 Skycloak. All Rights Reserved. Design by Yasser Soliman