Monitoring Keycloak with Prometheus and Grafana
Your identity provider handles every authentication event in your stack. When it slows down or fails, users cannot log in, APIs reject valid tokens, and support tickets pile up. Despite this, many teams treat their Keycloak deployment as a black box — they know it is running, but they have no visibility into how well it is performing or where bottlenecks are forming.
Keycloak ships with built-in metrics support through Micrometer and exposes a Prometheus-compatible endpoint out of the box. Combined with Grafana for visualization, you can build a comprehensive observability layer for your identity infrastructure without installing third-party plugins.
This guide walks through enabling Keycloak’s metrics endpoint, configuring Prometheus to scrape it, identifying the KPIs that matter for authentication systems, and building Grafana dashboards to monitor them. If you are new to Keycloak, start with the Skycloak documentation for an overview of core concepts.
Enabling the Keycloak Metrics Endpoint
Keycloak’s metrics endpoint exposes authentication throughput, latency percentiles, error rates, and infrastructure health in Prometheus format — making it the single most important data source for monitoring an identity system.
Keycloak’s metrics are disabled by default. To enable them, pass the --metrics-enabled=true flag at startup. The metrics are served on the /metrics path of the management interface and use the application/openmetrics-text content type, which Prometheus can scrape natively.
Here is a production-ready startup configuration that enables metrics along with histogram data for HTTP requests and event tracking:
bin/kc.sh start
--metrics-enabled=true
--http-metrics-histograms-enabled=true
--http-metrics-slos=250,500,1000,2500
--event-metrics-user-enabled=true
--event-metrics-user-tags=realm,idp,clientId
--cache-metrics-histograms-enabled=true
If you run Keycloak in a container, set these as environment variables. You can generate a complete Docker Compose file with the Skycloak Docker Compose Generator:
# docker-compose.yml (excerpt)
services:
keycloak:
image: quay.io/keycloak/keycloak:26.2
command: start
environment:
KC_METRICS_ENABLED: "true"
KC_HTTP_METRICS_HISTOGRAMS_ENABLED: "true"
KC_HTTP_METRICS_SLOS: "250,500,1000,2500"
KC_EVENT_METRICS_USER_ENABLED: "true"
KC_EVENT_METRICS_USER_TAGS: "realm,idp,clientId"
KC_CACHE_METRICS_HISTOGRAMS_ENABLED: "true"
A few details worth noting:
http-metrics-slosdefines latency buckets in milliseconds. The values above create histogram buckets at 250ms, 500ms, 1s, and 2.5s, which are useful for defining service level objectives later.event-metrics-user-tagscontrols which dimensions are added to event metrics. AddingclientIdlets you break down authentication events per application, but avoid this on deployments with hundreds of clients as it increases metric cardinality significantly.cache-metrics-histograms-enabledexposes Infinispan cache performance data, which is critical for diagnosing session and token lookup latency.
Once Keycloak is running, verify the endpoint is active:
curl -s http://localhost:9000/metrics | head -20
The management interface listens on port 9000 by default. You should see a stream of OpenMetrics-formatted data.
Key Keycloak Metrics to Track
Keycloak exposes three categories of metrics: Keycloak-specific counters for authentication events, standard HTTP server metrics from the underlying Quarkus runtime, and infrastructure metrics for database connections and caches.
The following table lists the most important metrics for monitoring authentication performance:
| Metric Name | Type | Description |
|---|---|---|
keycloak_user_events_total |
Counter | Total user events (logins, logouts, token exchanges) segmented by realm, client, IDP, event type, and error |
keycloak_credentials_password_hashing_validations_total |
Counter | Password hash validation attempts with outcome labels (valid, invalid, error) and algorithm details |
http_server_requests_seconds_count |
Counter | Total HTTP requests processed, labeled by method, status, URI pattern, and outcome |
http_server_requests_seconds_sum |
Counter | Cumulative request duration in seconds across all HTTP requests |
http_server_requests_seconds_bucket |
Histogram | Request duration distribution across configured SLO buckets |
http_server_active_requests |
Gauge | Current number of in-flight HTTP requests |
http_server_bytes_written_sum |
Counter | Total bytes sent in HTTP responses |
http_server_bytes_read_sum |
Counter | Total bytes received in HTTP requests |
agroal_active_count |
Gauge | Database connections currently in use |
agroal_available_count |
Gauge | Idle database connections in the pool |
agroal_awaiting_count |
Gauge | Threads waiting for a database connection |
These metrics combine to give you a complete picture of authentication throughput, latency, error rates, and resource utilization. You can inspect individual tokens issued by Keycloak using the JWT Token Analyzer to verify that claims and scopes are correct.
Configuring Prometheus to Scrape Keycloak
With metrics enabled, configure Prometheus to scrape the Keycloak management endpoint. Add a scrape job to your prometheus.yml:
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: "keycloak"
metrics_path: /metrics
scheme: http
static_configs:
- targets: ["keycloak:9000"]
labels:
environment: "production"
service: "identity"
# Optional: reduce cardinality by dropping high-churn labels
metric_relabel_configs:
- source_labels: [__name__]
regex: "jvm_buffer_.*"
action: drop
For Kubernetes deployments using the Prometheus Operator, create a ServiceMonitor instead:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: keycloak-metrics
namespace: monitoring
labels:
release: prometheus
spec:
namespaceSelector:
matchNames:
- keycloak
selector:
matchLabels:
app: keycloak
endpoints:
- port: management
path: /metrics
interval: 15s
Ensure your Prometheus instance has the release: prometheus label selector configured (or whatever label your Prometheus Operator watches for) so it picks up the ServiceMonitor.
Essential PromQL Queries for Authentication KPIs
Raw metrics become useful when you transform them into KPIs. The following PromQL queries cover the authentication metrics that matter most for operations teams.
Authentication Request Rate
Track how many authentication requests Keycloak is handling per second, filtered to login and token endpoints:
sum(rate(http_server_requests_seconds_count{
uri=~"/realms/.*/(protocol|login-actions)/.*"
}[5m]))
Login Success and Failure Rates
Use the keycloak_user_events_total counter to measure login outcomes. The event label distinguishes between successful logins and various failure types:
# Successful logins per minute
sum(rate(keycloak_user_events_total{event="LOGIN"}[5m])) * 60
# Failed logins per minute
sum(rate(keycloak_user_events_total{event="LOGIN_ERROR"}[5m])) * 60
# Login failure ratio
sum(rate(keycloak_user_events_total{event="LOGIN_ERROR"}[5m]))
/
(sum(rate(keycloak_user_events_total{event="LOGIN"}[5m]))
+ sum(rate(keycloak_user_events_total{event="LOGIN_ERROR"}[5m])))
Authentication Latency (P95)
If you enabled histogram buckets with --http-metrics-slos, you can calculate percentile latency for authentication endpoints:
histogram_quantile(0.95,
sum(rate(http_server_requests_seconds_bucket{
uri=~"/realms/.*/(protocol|login-actions)/.*"
}[5m])) by (le)
)
SLO Compliance: Percentage of Requests Under 250ms
This query directly measures what percentage of authentication requests meet a 250ms latency target, which aligns with Keycloak’s own recommended SLI approach:
sum(rate(http_server_requests_seconds_bucket{
uri=~"/realms/.*/(protocol|login-actions)/.*",
le="0.25"
}[5m]))
/
sum(rate(http_server_requests_seconds_count{
uri=~"/realms/.*/(protocol|login-actions)/.*"
}[5m]))
Server Error Rate
Track the proportion of authentication requests that result in 5xx errors:
sum(rate(http_server_requests_seconds_count{
uri=~"/realms/.*/(protocol|login-actions)/.*",
outcome="SERVER_ERROR"
}[5m]))
/
sum(rate(http_server_requests_seconds_count{
uri=~"/realms/.*/(protocol|login-actions)/.*"
}[5m]))
Database Connection Pool Saturation
A saturated connection pool is one of the most common causes of Keycloak latency spikes. Monitor the ratio of active connections to total pool capacity:
# Active connections as a percentage of total pool
agroal_active_count / (agroal_active_count + agroal_available_count)
# Threads waiting for connections (should be 0 under normal load)
agroal_awaiting_count
Password Hashing Performance
Password hashing is CPU-intensive and often the bottleneck in login flows. Track validation rates by outcome:
sum(rate(keycloak_credentials_password_hashing_validations_total[5m])) by (outcome)
Building Grafana Dashboards
The Keycloak project maintains official Grafana dashboards in the keycloak-grafana-dashboard GitHub repository. Two dashboards are available:
- Keycloak Troubleshooting Dashboard (
keycloak-troubleshooting-dashboard.json) — displays the service level indicators described above, with drill-down panels for investigating latency and error spikes. - Keycloak Capacity Planning Dashboard (
keycloak-capacity-planning-dashboard.json) — tracks password validations, login flow volumes, and resource utilization to help with scaling decisions. This dashboard requires event metrics to be enabled.
To import them, download the JSON files from the repository (use the 26.2.0 branch for Keycloak 26.1-26.2, or main for 26.3 and later), then import via Grafana’s dashboard UI.
Custom Dashboard Panels
Beyond the official dashboards, you will likely want panels tailored to your specific SLOs. Here is a Grafana panel configuration for an authentication latency heatmap:
{
"title": "Authentication Latency Distribution",
"type": "heatmap",
"datasource": "Prometheus",
"targets": [
{
"expr": "sum(increase(http_server_requests_seconds_bucket{uri=~"/realms/.*/(protocol|login-actions)/.*"}[1m])) by (le)",
"format": "heatmap",
"legendFormat": "{{le}}"
}
],
"yAxis": {
"format": "s",
"label": "Latency"
},
"options": {
"calculate": false,
"color": {
"scheme": "Spectral",
"reverse": true
}
}
}
And a stat panel for real-time SLO compliance:
{
"title": "Auth Latency SLO (target: 95% < 250ms)",
"type": "stat",
"datasource": "Prometheus",
"targets": [
{
"expr": "sum(rate(http_server_requests_seconds_bucket{uri=~"/realms/.*/(protocol|login-actions)/.*", le="0.25"}[30d])) / sum(rate(http_server_requests_seconds_count{uri=~"/realms/.*/(protocol|login-actions)/.*"}[30d]))",
"legendFormat": "SLO Compliance"
}
],
"fieldConfig": {
"defaults": {
"unit": "percentunit",
"thresholds": {
"steps": [
{ "color": "red", "value": 0 },
{ "color": "yellow", "value": 0.90 },
{ "color": "green", "value": 0.95 }
]
}
}
}
}
Recommended Dashboard Layout
Organize your Keycloak monitoring dashboard into four rows:
- Overview — authentication request rate, active sessions gauge, SLO compliance percentage, current error rate
- Latency — P50/P95/P99 time series, latency heatmap, slowest endpoints table
- Authentication Events — login success/failure rates, login failure ratio, events broken down by realm and client
- Infrastructure — database connection pool usage, active HTTP connections, JVM heap utilization, password hashing rate
Defining Meaningful SLOs for Identity Systems
Having metrics and dashboards is only half the picture. To make them actionable, define service level objectives that reflect what your users actually experience.
A service level objective (SLO) for an identity system defines the minimum acceptable performance for authentication operations, measured over a rolling window (typically 30 days), and expressed as a percentage of requests meeting a latency or availability target.
Keycloak’s own SLO documentation suggests three SLIs as a starting point:
- Availability: Keycloak should be available 99.9% of the time within a rolling 30-day window (roughly 44 minutes of downtime per month).
- Latency: 95% of authentication-related requests should complete in under 250ms over a 30-day period.
- Error rate: Server-side errors on authentication endpoints should remain below 0.1%.
These are reasonable defaults, but adjust them based on your application’s requirements. A consumer-facing authentication flow may need tighter latency targets than an internal workforce login portal.
Set up Prometheus alerting rules to notify your team before SLOs are breached:
# alerts.yml
groups:
- name: keycloak-slos
rules:
- alert: KeycloakHighLoginLatency
expr: |
histogram_quantile(0.95,
sum(rate(http_server_requests_seconds_bucket{
uri=~"/realms/.*/(protocol|login-actions)/.*"
}[5m])) by (le)
) > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "Keycloak P95 login latency exceeds 500ms"
- alert: KeycloakHighErrorRate
expr: |
sum(rate(http_server_requests_seconds_count{
uri=~"/realms/.*/(protocol|login-actions)/.*",
outcome="SERVER_ERROR"
}[5m]))
/
sum(rate(http_server_requests_seconds_count{
uri=~"/realms/.*/(protocol|login-actions)/.*"
}[5m])) > 0.001
for: 5m
labels:
severity: critical
annotations:
summary: "Keycloak server error rate exceeds 0.1%"
- alert: KeycloakDBPoolExhausted
expr: agroal_awaiting_count > 0
for: 2m
labels:
severity: warning
annotations:
summary: "Threads are waiting for database connections"
Reducing the Operational Burden
Setting up Prometheus, Grafana, alerting rules, and dashboard maintenance is significant operational overhead. For every Keycloak version upgrade, you need to verify metric compatibility, update dashboard JSON, and validate that alert thresholds still make sense.
If you would rather focus on your application than on maintaining identity infrastructure monitoring, Skycloak provides built-in observability through its Insights feature. Authentication metrics, login analytics, and performance data are available out of the box with no Prometheus configuration or Grafana maintenance required. The monitoring layer is maintained alongside the Keycloak deployment itself, so metric compatibility is never a concern during upgrades. You can also forward Keycloak events to your SIEM via HTTP webhook for security-focused event streaming alongside metrics.
Next Steps
Start with the basics: enable --metrics-enabled=true on your Keycloak deployment and verify the /metrics endpoint responds. Add the Prometheus scrape configuration and import the official Grafana dashboards. Once you have baseline visibility, layer on the custom PromQL queries and alerting rules that match your SLO targets.
The queries and configurations in this guide give you a foundation, but every deployment is different. Pay attention to which metrics are most useful for your specific traffic patterns and tune your dashboards and alert thresholds accordingly.
Ready to skip the monitoring setup entirely? Explore Skycloak’s managed Keycloak hosting with built-in observability.
Ready to simplify your authentication?
Deploy production-ready Keycloak in minutes. Unlimited users, flat pricing, no SSO tax.