Keycloak on OpenShift (ARO): HA That Survives Audits

Last updated: June 2026

TL;DR

Keycloak on OpenShift is easy to demo and surprisingly easy to run badly. Here is the production checklist, compressed:

Use the Operator, not Helm. The official Keycloak Operator (and its supported twin, the Red Hat build of Keycloak on OperatorHub) owns upgrades, TLS wiring, and pod topology. Community Helm charts lag releases and leave day-2 to you.
HA got easier in the 26 line. Keycloak 26 made JDBC_PING the default node discovery and persistent user sessions the default storage model, so a restart no longer logs everyone out and clustering works without headless-service archaeology.
Your database is your real SLA. Run external PostgreSQL (on ARO that means Azure Database for PostgreSQL Flexible Server with zone-redundant HA). The embedded dev database in production is a resume-generating event.
Pick your TLS route mode on purpose. Edge, re-encrypt, or passthrough each change who terminates TLS and what an auditor sees. Re-encrypt is the enterprise default for a reason.
Size with the official formula, not vibes. 1 vCPU per 15 password logins per second, 1 vCPU per 120 client-credential grants per second, plus 150 percent headroom and roughly 1250 MB of RAM per pod. Numbers below.

We operate HA Keycloak clusters for a living at Skycloak, so everything here is the configuration we actually run, not a lab demo.

Operator, Helm, or raw manifests?

Use the Operator. On OpenShift specifically there are two flavors of the same answer: the upstream Keycloak Operator if you track community releases (currently the 26.6 line, with 26.6.3 released June 2026), or the Red Hat build of Keycloak (RHBK) Operator from OperatorHub if you want Red Hat support attached, which is usually the point of paying for ARO in the first place. RHBK replaced the old RH-SSO product line and tracks upstream Keycloak closely, so the configuration surface in this post applies to both.

What the Operator buys you over Helm or hand-rolled manifests:

A Keycloak custom resource that declares replicas, database, hostname, and TLS in one reviewable object instead of forty templated YAML files.
Managed rolling upgrades that respect Keycloak’s upgrade ordering, which matters because identity is the one workload where “just bounce it and see” is not a strategy.
Sane defaults for probes, pod labels, and service wiring that match what the Keycloak team load-tests against.

The honest caveat: the Operator is opinionated. If your platform team needs exotic init containers or sidecar meshes, you will occasionally fight it. Fight it anyway. The exotic setups are where outages live.

What the declarative side looks like in practice, trimmed to the load-bearing fields:

apiVersion: k8s.keycloak.org/v2alpha1
kind: Keycloak
metadata:
  name: keycloak
  namespace: sso
spec:
  instances: 3
  db:
    vendor: postgres
    host: psql-flex.example.postgres.database.azure.com
    usernameSecret:
      name: kc-db
      key: username
    passwordSecret:
      name: kc-db
      key: password
  hostname:
    hostname: login.example.com
    admin: admin-login.internal.example.com
  proxy:
    headers: xforwarded
  http:
    tlsSecret: keycloak-tls

Three replicas, external Azure Postgres, split admin hostname, forwarded headers, and TLS to the pod. Everything else in this post hangs off those few lines.

What does HA actually require in the Keycloak 26 line?

Less than the blog posts from 2022 told you, and this is worth internalizing because a lot of stale advice still circulates. Two changes in Keycloak 26 did the heavy lifting, per the official release notes:

JDBC_PING became the default discovery mechanism. Cluster members find each other through the shared database instead of multicast or custom headless-service DNS. On OpenShift this removes an entire class of “nodes formed two separate clusters” incidents.
Persistent user sessions became the default. Sessions are written to the database rather than living only in the embedded Infinispan caches. A full cluster restart no longer signs everyone out, and losing a pod no longer torches the sessions it held.

That leaves a short, concrete HA checklist:

Three replicas minimum spread across zones with topology spread constraints. ARO clusters span three availability zones in supported regions; use them.
A PodDisruptionBudget so node drains during cluster upgrades never take two Keycloak pods at once.
External PostgreSQL with its own HA story. On ARO, Azure Database for PostgreSQL Flexible Server with zone-redundant HA is the boring, correct choice. Since both discovery and sessions now live in the database, Postgres availability is Keycloak availability. Treat its failover testing as part of your IdP’s runbook, not the DBA team’s problem.
Embedded Infinispan still caches hot data, and at large realm counts the cache and heap need attention. We published our measurements on that in how many realms one Keycloak cluster can handle, and our Redis cache benchmark covers the external-cache variant.

Do you still need sticky sessions? Mostly no, with persistent sessions on. Cookie affinity still smooths the login flow itself, and the default OpenShift Route gives you that for free, so leave it on and move on.

TLS on ARO: edge, re-encrypt, or passthrough?

OpenShift Routes give you three termination modes, and the one you pick decides what your security review looks like:

Mode	TLS terminates at	Traffic to the pod	When to pick it
Edge	The router	Plain HTTP inside the cluster	Dev clusters, or networks where the SDN is considered trusted. Auditors increasingly hate it.
Re-encrypt	The router, then re-established	TLS with the cluster-internal cert	The enterprise default. Wildcard cert at the edge, internal cert to the pod, every hop encrypted.
Passthrough	The Keycloak pod itself	TLS end to end, untouched	When compliance requires the IdP to terminate its own TLS, or for mTLS client-certificate flows.

Practical ARO notes that save a ticket later:

The default *.apps.<cluster>.<region>.aroapp.io wildcard certificate is fine for proving the install works and wrong for production. Put Keycloak on a real hostname (KC_HOSTNAME=login.example.com) with a certificate from the cert-manager Operator wired to your CA or Azure DNS for ACME.
Behind any route except passthrough, set KC_PROXY_HEADERS=xforwarded so Keycloak builds correct redirect URIs. The symptom of forgetting is OIDC redirects pointing at internal service names, and the bug report will come from your angriest app team.
Split the admin console onto its own hostname with KC_HOSTNAME_ADMIN and restrict it at the network level. Exposing /admin on the same public route as login is the single most common finding we see in reviews of self-hosted Keycloak.

How big should the pods be? Use the official formula

Not invented tiers, not “medium feels right.” The Keycloak team publishes load-tested numbers in Concepts for sizing CPU and memory resources, and they are refreshingly specific:

1 vCPU per 15 password-based logins per second
1 vCPU per 120 client-credential grants per second
1 vCPU per 120 refresh token requests per second
Add 150 percent CPU headroom for spikes and node-failure failover
Roughly 1250 MB of RAM per pod as a baseline, with heap around 70 percent of the container limit and about 300 MB of non-heap overhead

Notice what is not in that list: registered user count. Keycloak sizing is driven by request rate, not by how many rows sit in the user table. A 2-million-user realm that peaks at 20 logins per second needs less CPU than a 50,000-user realm doing 100 logins per second during a shift change.

Worked example: a workforce estate peaking at 45 password logins per second with 60 refresh requests per second needs 3 vCPU for logins plus 0.5 vCPU for refreshes, so 3.5 vCPU of steady-state capacity. With 150 percent headroom that is roughly 9 vCPU across the cluster, which lands comfortably on three pods of 3 vCPU and 1.5 GB each across three zones. The math takes five minutes and removes the entire “is it sized right” argument from your architecture review.

The audit checklist

The difference between a lab Keycloak and one that survives an enterprise audit is a short list, but every line on it is load-bearing:

Event logging shipped off the box. Enable login and admin events, set retention, and ship them to your SIEM. Auditors ask for failed-login evidence by date range, and “the pod restarted, the events are gone” is not an answer. Persistent events in the database plus log forwarding covers it.
Admin surface locked down. Separate admin hostname, network policy restricting it to the management network, no permanent super-admin accounts in the master realm, and admin sessions on short timeouts.
TLS posture documented. Which route mode, which certificates, who rotates them, and what the internal hop looks like. Re-encrypt with cert-manager-managed certificates makes this paragraph short.
Backup and restore actually rehearsed. The database backup is the Keycloak backup. What auditors want to see is a restore drill with a timestamp, not a screenshot of the backup policy.
FIPS, if your sector requires it. Keycloak supports running in FIPS-140 mode with the BouncyCastle FIPS provider. It is genuinely supported and genuinely fiddly, so budget real time if this applies to you.
Upgrade cadence with evidence. Keycloak ships minors and patches continuously, and the 26.5 line already reached end of life in April 2026 per endoflife.date. An IdP two minor versions behind is an audit finding waiting to be written.

What running this yourself actually costs

Time for the honest section. Everything above is achievable with the Operator and a competent platform team. What the architecture diagram does not show is the recurring cost: tracking a monthly-ish release train, rehearsing database failovers, owning the 3 a.m. page when login breaks, rotating certificates, and re-validating the whole stack every OpenShift upgrade. In practice that is a meaningful slice of a platform engineer, permanently, for a single piece of infrastructure that every other system depends on.

That is the trade Skycloak exists for: we run the HA Keycloak, the upgrades, the monitoring, and the on-call, and your team consumes identity as a service while keeping all of Keycloak’s flexibility. If you would rather see the numbers than the pitch, our pricing is public, and if you are mid-design on an ARO deployment, talk to us and we will happily review your architecture against this checklist.

Frequently asked questions

How do I deploy Keycloak on OpenShift?

Use the Keycloak Operator from OperatorHub, either the upstream community build or the Red Hat build of Keycloak for supported environments. Declare replicas, database, hostname, and TLS in the Keycloak custom resource. Avoid community Helm charts for production, since they lag releases and leave upgrades entirely to you.

Is Keycloak production-ready on Kubernetes and ARO?

Yes, and the 26 line made it markedly easier: JDBC_PING discovery and persistent user sessions are defaults, so clustering and restarts behave sanely out of the box. Production readiness then hinges on external HA PostgreSQL, three zone-spread replicas, deliberate TLS termination, and request-rate-based sizing.

How many replicas does Keycloak need for high availability?

Three, spread across availability zones with a PodDisruptionBudget. Two replicas survive a pod failure but leave you running without redundancy during every upgrade or node drain. Add capacity beyond three based on the official sizing formula of 1 vCPU per 15 password logins per second plus 150 percent headroom.

How much CPU and memory does Keycloak need?

Size by request rate, not user count: 1 vCPU per 15 password logins per second, 1 vCPU per 120 client-credential or refresh grants per second, 150 percent headroom on top, and roughly 1250 MB of RAM per pod with heap at about 70 percent of the container limit, per the official Keycloak sizing guide.

Rather skip the YAML and the on-call entirely? Try Skycloak free and get an HA Keycloak cluster that already passes this checklist.