Blog / Tutorials

Monitoring Keycloak with Prometheus and Grafana

Guilliano Molaire March 16, 2026 8 min read

Your identity provider handles every authentication event in your stack. When it slows down or fails, users cannot log in, APIs reject valid tokens, and support tickets pile up. Despite this, many teams treat their Keycloak deployment as a black box — they know it is running, but they have no visibility into how well it is performing or where bottlenecks are forming.

Keycloak ships with built-in metrics support through Micrometer and exposes a Prometheus-compatible endpoint out of the box. Combined with Grafana for visualization, you can build a comprehensive observability layer for your identity infrastructure without installing third-party plugins.

This guide walks through enabling Keycloak’s metrics endpoint, configuring Prometheus to scrape it, identifying the KPIs that matter for authentication systems, and building Grafana dashboards to monitor them. If you are new to Keycloak, start with the Skycloak documentation for an overview of core concepts.

Enabling the Keycloak Metrics Endpoint

Keycloak’s metrics endpoint exposes authentication throughput, latency percentiles, error rates, and infrastructure health in Prometheus format — making it the single most important data source for monitoring an identity system.

Keycloak’s metrics are disabled by default. To enable them, pass the --metrics-enabled=true flag at startup. The metrics are served on the /metrics path of the management interface and use the application/openmetrics-text content type, which Prometheus can scrape natively.

Here is a production-ready startup configuration that enables metrics along with histogram data for HTTP requests and event tracking:

bin/kc.sh start 
  --metrics-enabled=true 
  --http-metrics-histograms-enabled=true 
  --http-metrics-slos=250,500,1000,2500 
  --event-metrics-user-enabled=true 
  --event-metrics-user-tags=realm,idp,clientId 
  --cache-metrics-histograms-enabled=true

If you run Keycloak in a container, set these as environment variables. You can generate a complete Docker Compose file with the Skycloak Docker Compose Generator:

# docker-compose.yml (excerpt)
services:
  keycloak:
    image: quay.io/keycloak/keycloak:26.2
    command: start
    environment:
      KC_METRICS_ENABLED: "true"
      KC_HTTP_METRICS_HISTOGRAMS_ENABLED: "true"
      KC_HTTP_METRICS_SLOS: "250,500,1000,2500"
      KC_EVENT_METRICS_USER_ENABLED: "true"
      KC_EVENT_METRICS_USER_TAGS: "realm,idp,clientId"
      KC_CACHE_METRICS_HISTOGRAMS_ENABLED: "true"

A few details worth noting:

http-metrics-slos defines latency buckets in milliseconds. The values above create histogram buckets at 250ms, 500ms, 1s, and 2.5s, which are useful for defining service level objectives later.
event-metrics-user-tags controls which dimensions are added to event metrics. Adding clientId lets you break down authentication events per application, but avoid this on deployments with hundreds of clients as it increases metric cardinality significantly.
cache-metrics-histograms-enabled exposes Infinispan cache performance data, which is critical for diagnosing session and token lookup latency.

Once Keycloak is running, verify the endpoint is active:

curl -s http://localhost:9000/metrics | head -20

The management interface listens on port 9000 by default. You should see a stream of OpenMetrics-formatted data.

Key Keycloak Metrics to Track

Keycloak exposes three categories of metrics: Keycloak-specific counters for authentication events, standard HTTP server metrics from the underlying Quarkus runtime, and infrastructure metrics for database connections and caches.

The following table lists the most important metrics for monitoring authentication performance:

Metric Name	Type	Description
`keycloak_user_events_total`	Counter	Total user events (logins, logouts, token exchanges) segmented by realm, client, IDP, event type, and error
`keycloak_credentials_password_hashing_validations_total`	Counter	Password hash validation attempts with outcome labels (valid, invalid, error) and algorithm details
`http_server_requests_seconds_count`	Counter	Total HTTP requests processed, labeled by method, status, URI pattern, and outcome
`http_server_requests_seconds_sum`	Counter	Cumulative request duration in seconds across all HTTP requests
`http_server_requests_seconds_bucket`	Histogram	Request duration distribution across configured SLO buckets
`http_server_active_requests`	Gauge	Current number of in-flight HTTP requests
`http_server_bytes_written_sum`	Counter	Total bytes sent in HTTP responses
`http_server_bytes_read_sum`	Counter	Total bytes received in HTTP requests
`agroal_active_count`	Gauge	Database connections currently in use
`agroal_available_count`	Gauge	Idle database connections in the pool
`agroal_awaiting_count`	Gauge	Threads waiting for a database connection

These metrics combine to give you a complete picture of authentication throughput, latency, error rates, and resource utilization. You can inspect individual tokens issued by Keycloak using the JWT Token Analyzer to verify that claims and scopes are correct.

Configuring Prometheus to Scrape Keycloak

With metrics enabled, configure Prometheus to scrape the Keycloak management endpoint. Add a scrape job to your prometheus.yml:

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: "keycloak"
    metrics_path: /metrics
    scheme: http
    static_configs:
      - targets: ["keycloak:9000"]
        labels:
          environment: "production"
          service: "identity"
    # Optional: reduce cardinality by dropping high-churn labels
    metric_relabel_configs:
      - source_labels: [__name__]
        regex: "jvm_buffer_.*"
        action: drop

For Kubernetes deployments using the Prometheus Operator, create a ServiceMonitor instead:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: keycloak-metrics
  namespace: monitoring
  labels:
    release: prometheus
spec:
  namespaceSelector:
    matchNames:
      - keycloak
  selector:
    matchLabels:
      app: keycloak
  endpoints:
    - port: management
      path: /metrics
      interval: 15s

Ensure your Prometheus instance has the release: prometheus label selector configured (or whatever label your Prometheus Operator watches for) so it picks up the ServiceMonitor.

Essential PromQL Queries for Authentication KPIs

Raw metrics become useful when you transform them into KPIs. The following PromQL queries cover the authentication metrics that matter most for operations teams.

Authentication Request Rate

Track how many authentication requests Keycloak is handling per second, filtered to login and token endpoints:

sum(rate(http_server_requests_seconds_count{
  uri=~"/realms/.*/(protocol|login-actions)/.*"
}[5m]))

Login Success and Failure Rates

Use the keycloak_user_events_total counter to measure login outcomes. The event label distinguishes between successful logins and various failure types:

# Successful logins per minute
sum(rate(keycloak_user_events_total{event="LOGIN"}[5m])) * 60

# Failed logins per minute
sum(rate(keycloak_user_events_total{event="LOGIN_ERROR"}[5m])) * 60

# Login failure ratio
sum(rate(keycloak_user_events_total{event="LOGIN_ERROR"}[5m]))
/
(sum(rate(keycloak_user_events_total{event="LOGIN"}[5m]))
+ sum(rate(keycloak_user_events_total{event="LOGIN_ERROR"}[5m])))

Authentication Latency (P95)

If you enabled histogram buckets with --http-metrics-slos, you can calculate percentile latency for authentication endpoints:

histogram_quantile(0.95,
  sum(rate(http_server_requests_seconds_bucket{
    uri=~"/realms/.*/(protocol|login-actions)/.*"
  }[5m])) by (le)
)

SLO Compliance: Percentage of Requests Under 250ms

This query directly measures what percentage of authentication requests meet a 250ms latency target, which aligns with Keycloak’s own recommended SLI approach:

sum(rate(http_server_requests_seconds_bucket{
  uri=~"/realms/.*/(protocol|login-actions)/.*",
  le="0.25"
}[5m]))
/
sum(rate(http_server_requests_seconds_count{
  uri=~"/realms/.*/(protocol|login-actions)/.*"
}[5m]))

Server Error Rate

Track the proportion of authentication requests that result in 5xx errors:

sum(rate(http_server_requests_seconds_count{
  uri=~"/realms/.*/(protocol|login-actions)/.*",
  outcome="SERVER_ERROR"
}[5m]))
/
sum(rate(http_server_requests_seconds_count{
  uri=~"/realms/.*/(protocol|login-actions)/.*"
}[5m]))

Database Connection Pool Saturation

A saturated connection pool is one of the most common causes of Keycloak latency spikes. Monitor the ratio of active connections to total pool capacity:

# Active connections as a percentage of total pool
agroal_active_count / (agroal_active_count + agroal_available_count)

# Threads waiting for connections (should be 0 under normal load)
agroal_awaiting_count

Password Hashing Performance

Password hashing is CPU-intensive and often the bottleneck in login flows. Track validation rates by outcome:

sum(rate(keycloak_credentials_password_hashing_validations_total[5m])) by (outcome)

Building Grafana Dashboards

The Keycloak project maintains official Grafana dashboards in the keycloak-grafana-dashboard GitHub repository. Two dashboards are available:

Keycloak Troubleshooting Dashboard (keycloak-troubleshooting-dashboard.json) — displays the service level indicators described above, with drill-down panels for investigating latency and error spikes.
Keycloak Capacity Planning Dashboard (keycloak-capacity-planning-dashboard.json) — tracks password validations, login flow volumes, and resource utilization to help with scaling decisions. This dashboard requires event metrics to be enabled.

To import them, download the JSON files from the repository (use the 26.2.0 branch for Keycloak 26.1-26.2, or main for 26.3 and later), then import via Grafana’s dashboard UI.

Custom Dashboard Panels

Beyond the official dashboards, you will likely want panels tailored to your specific SLOs. Here is a Grafana panel configuration for an authentication latency heatmap:

{
  "title": "Authentication Latency Distribution",
  "type": "heatmap",
  "datasource": "Prometheus",
  "targets": [
    {
      "expr": "sum(increase(http_server_requests_seconds_bucket{uri=~"/realms/.*/(protocol|login-actions)/.*"}[1m])) by (le)",
      "format": "heatmap",
      "legendFormat": "{{le}}"
    }
  ],
  "yAxis": {
    "format": "s",
    "label": "Latency"
  },
  "options": {
    "calculate": false,
    "color": {
      "scheme": "Spectral",
      "reverse": true
    }
  }
}

And a stat panel for real-time SLO compliance:

{
  "title": "Auth Latency SLO (target: 95% < 250ms)",
  "type": "stat",
  "datasource": "Prometheus",
  "targets": [
    {
      "expr": "sum(rate(http_server_requests_seconds_bucket{uri=~"/realms/.*/(protocol|login-actions)/.*", le="0.25"}[30d])) / sum(rate(http_server_requests_seconds_count{uri=~"/realms/.*/(protocol|login-actions)/.*"}[30d]))",
      "legendFormat": "SLO Compliance"
    }
  ],
  "fieldConfig": {
    "defaults": {
      "unit": "percentunit",
      "thresholds": {
        "steps": [
          { "color": "red", "value": 0 },
          { "color": "yellow", "value": 0.90 },
          { "color": "green", "value": 0.95 }
        ]
      }
    }
  }
}

Recommended Dashboard Layout

Organize your Keycloak monitoring dashboard into four rows:

Overview — authentication request rate, active sessions gauge, SLO compliance percentage, current error rate
Latency — P50/P95/P99 time series, latency heatmap, slowest endpoints table
Authentication Events — login success/failure rates, login failure ratio, events broken down by realm and client
Infrastructure — database connection pool usage, active HTTP connections, JVM heap utilization, password hashing rate

Defining Meaningful SLOs for Identity Systems

Having metrics and dashboards is only half the picture. To make them actionable, define service level objectives that reflect what your users actually experience.

A service level objective (SLO) for an identity system defines the minimum acceptable performance for authentication operations, measured over a rolling window (typically 30 days), and expressed as a percentage of requests meeting a latency or availability target.

Keycloak’s own SLO documentation suggests three SLIs as a starting point:

Availability: Keycloak should be available 99.9% of the time within a rolling 30-day window (roughly 44 minutes of downtime per month).
Latency: 95% of authentication-related requests should complete in under 250ms over a 30-day period.
Error rate: Server-side errors on authentication endpoints should remain below 0.1%.

These are reasonable defaults, but adjust them based on your application’s requirements. A consumer-facing authentication flow may need tighter latency targets than an internal workforce login portal.

Set up Prometheus alerting rules to notify your team before SLOs are breached:

# alerts.yml
groups:
  - name: keycloak-slos
    rules:
      - alert: KeycloakHighLoginLatency
        expr: |
          histogram_quantile(0.95,
            sum(rate(http_server_requests_seconds_bucket{
              uri=~"/realms/.*/(protocol|login-actions)/.*"
            }[5m])) by (le)
          ) > 0.5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Keycloak P95 login latency exceeds 500ms"

      - alert: KeycloakHighErrorRate
        expr: |
          sum(rate(http_server_requests_seconds_count{
            uri=~"/realms/.*/(protocol|login-actions)/.*",
            outcome="SERVER_ERROR"
          }[5m]))
          /
          sum(rate(http_server_requests_seconds_count{
            uri=~"/realms/.*/(protocol|login-actions)/.*"
          }[5m])) > 0.001
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Keycloak server error rate exceeds 0.1%"

      - alert: KeycloakDBPoolExhausted
        expr: agroal_awaiting_count > 0
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "Threads are waiting for database connections"

Reducing the Operational Burden

Setting up Prometheus, Grafana, alerting rules, and dashboard maintenance is significant operational overhead. For every Keycloak version upgrade, you need to verify metric compatibility, update dashboard JSON, and validate that alert thresholds still make sense.

If you would rather focus on your application than on maintaining identity infrastructure monitoring, Skycloak provides built-in observability through its Insights feature. Authentication metrics, login analytics, and performance data are available out of the box with no Prometheus configuration or Grafana maintenance required. The monitoring layer is maintained alongside the Keycloak deployment itself, so metric compatibility is never a concern during upgrades. You can also forward Keycloak events to your SIEM via HTTP webhook for security-focused event streaming alongside metrics.

Next Steps

Start with the basics: enable --metrics-enabled=true on your Keycloak deployment and verify the /metrics endpoint responds. Add the Prometheus scrape configuration and import the official Grafana dashboards. Once you have baseline visibility, layer on the custom PromQL queries and alerting rules that match your SLO targets.

The queries and configurations in this guide give you a foundation, but every deployment is different. Pay attention to which metrics are most useful for your specific traffic patterns and tune your dashboards and alert thresholds accordingly.

Ready to skip the monitoring setup entirely? Explore Skycloak’s managed Keycloak hosting with built-in observability.

Written by Guilliano Molaire Founder

Guilliano is the founder of Skycloak and a cloud infrastructure specialist with deep expertise in product development and scaling SaaS products. He discovered Keycloak while consulting on enterprise IAM and built Skycloak to make managed Keycloak accessible to teams of every size.

View all posts

Ready to simplify your authentication?

Deploy production-ready Keycloak in minutes. Unlimited users, flat pricing, no SSO tax.

Start Free Trial Talk to Sales