From Loki to ClickHouse: Scaling Our Log Analytics Infrastructure
A technical deep-dive into why we’re moving from Grafana Loki to ClickHouse for Keycloak log storage and analytics
TL;DR
We’re migrating our Keycloak logging infrastructure from Grafana Loki to ClickHouse. After analyzing 3 months of production data (~438 million logs), we found that ClickHouse better suits our needs for:
- Full-text search performance
- Long-term log retention and analytics
- Unified storage for logs + events (replacing Elasticsearch)
- Cost-effective scaling
The Current State: Loki
Our managed Keycloak platform runs across 4 regions (US, EU, AU, CA), with each region running its own Loki instance for log aggregation. Keycloak pods ship logs to Loki via Promtail, and our app queries Loki’s API to display logs in the UI.
Log Volume
Across regions, our platform processes millions of logs per day. At that scale, we started hitting Loki’s limits for our use case.
Keycloak Log Format
Keycloak uses JBoss Logging with the default pattern:
%d{yyyy-MM-dd HH:mm:ss,SSS} %-5p [%c] (%t) %s%e%n
Which produces logs like:
2026-01-12 01:49:21,668 DEBUG [org.keycloak.events] (executor-thread-43) type="REFRESH_TOKEN", realmId="993329ad-...", clientId="portal", userId="c0ca2f5d-..."
Field breakdown:
| Specifier | Field Name | Example |
|---|---|---|
%d |
timestamp | 2026-01-12 01:49:21,668 |
%p |
level | DEBUG, INFO, WARN, ERROR |
%c |
category | org.keycloak.events |
%t |
thread_name | executor-thread-43 |
%s%e |
message | The log content + exception |
Why Loki Falls Short for Our Use Case
1. Query Time Limits
Loki enforces a 30-day maximum query range. For compliance and debugging, customers often need to search logs from 60-90 days ago. We had to implement chunked queries:
// We have to query in 30-day chunks and combine results
for _, period := range []string{"Oct-Nov", "Nov-Dec", "Dec-Jan"} {
results = append(results, queryLokiChunk(period))
}
2. Result Set Limits
Loki caps results at 5,000 entries per query. For customers with high-traffic Keycloak instances, this means pagination complexity:
// Frontend has to manage cursor-based pagination
const [oldestTimestamp, setOldestTimestamp] = useState<string>();
const [loadedCount, setLoadedCount] = useState(0);
const MAX_LOGS = 50000; // Our self-imposed limit to prevent memory issues
3. Full-Text Search Performance
Analyzing 3 months of user search patterns revealed:
| Search Pattern | % of Searches | Example |
|---|---|---|
| Error keywords | 20% | error, null, connection reset |
| Usernames/IDs | 40% | jgrigg, executor-thread-31 |
| UUIDs | 4% | 7fe07622-5ceb-4217-847f-a6dce3a463ef |
| Email addresses | 3% | [email protected] |
| Java categories | 10% | [org.keycloak.theme.DefaultThemeManager] |
| IP addresses | 2% | 165.228.75.218 |
Loki’s LogQL isn’t optimized for these mixed search patterns. Regex searches like |~ "[email protected]" perform full scans, which degrades with scale.
4. No Native Analytics
We want to answer questions like:
- “How many LOGIN_ERROR events per realm this week?”
- “Which clients generate the most token refresh requests?”
- “What’s the P95 latency trend for authentication?”
Loki requires exporting to another system for analytics. ClickHouse handles both in one place.
Cardinality Analysis: Why ClickHouse Fits
Before designing our ClickHouse schema, we analyzed field cardinality across 10,000 sampled logs:
Log Levels (Cardinality: 5)
DEBUG 47%
INFO 43%
WARN 9%
FINE 0.2%
ERROR 0.1%
Verdict: Perfect for LowCardinality(String)
Categories (Cardinality: ~31)
org.keycloak.events 45%
io.quarkus.micrometer.runtime.binder.vertx.* 23%
org.hibernate.query.sqm.sql.* 9%
org.keycloak.protocol.oidc.utils.AcrUtils 7%
... (27 more)
Verdict: Low cardinality, perfect for LowCardinality(String)
Thread Names (Raw vs Normalized)
Raw thread names have HIGH cardinality due to numeric worker IDs:
executor-thread-1,executor-thread-2, …executor-thread-999
But normalized, they collapse to just 6 unique values:
executor-thread-N 55%
vert.x-eventloop-thread-N 26%
Timer-N 17%
Periodic Recovery 2%
Transaction Expired Entry 0.1%
Connection evictor 0.02%
Normalization regex:
replaceRegexpAll(thread_name, '-\d+$', '-N')
Verdict: Store raw for debugging, add materialized normalized column for LowCardinality
The ClickHouse Schema
CREATE TABLE keycloak_logs (
-- Timestamp with millisecond precision
timestamp DateTime64(3),
-- Identifiers
cluster_id LowCardinality(String),
-- Log structure (all low cardinality)
level LowCardinality(String),
category LowCardinality(String),
thread_name String,
thread_name_base LowCardinality(String) MATERIALIZED
replaceRegexpAll(thread_name, '-\d+$', '-N'),
-- Content
message String,
-- Parsed from org.keycloak.events messages
event_type LowCardinality(String),
realm_id String,
realm_name LowCardinality(String),
client_id LowCardinality(String),
user_id String,
session_id String,
ip_address String,
-- Indexing
INDEX idx_message message TYPE tokenbf_v1(10240, 3, 0) GRANULARITY 4,
INDEX idx_user_id user_id TYPE bloom_filter() GRANULARITY 4,
INDEX idx_ip ip_address TYPE bloom_filter() GRANULARITY 4
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY (cluster_id, level, timestamp)
TTL timestamp + INTERVAL 90 DAY
SETTINGS index_granularity = 8192;
Key Design Decisions
1. LowCardinality for enum-like fields
ClickHouse’s LowCardinality type uses dictionary encoding, dramatically reducing storage and speeding up filtering for fields with < 10,000 unique values.
2. Bloom filter indexes for high-cardinality searches
For user_id, ip_address, and message content, bloom filters enable fast “needle in haystack” queries without full scans.
3. Materialized thread_name_base
Computed at insert time, this gives us the best of both worlds: raw thread names for debugging, normalized names for analytics.
4. 90-day TTL
Automatic data expiration handles retention without manual cleanup jobs.
5. Partition by month
Enables efficient partition pruning for time-range queries and simplified data lifecycle management.
Search Pattern Optimization
Based on our analysis of real user searches, we’re optimizing for these patterns:
1. Exact Match (UUIDs, IPs, Emails)
-- ClickHouse: Uses bloom filter, O(1) lookup
SELECT * FROM keycloak_logs
WHERE user_id = '7fe07622-5ceb-4217-847f-a6dce3a463ef';
-- Loki: Full regex scan
{container="keycloak"} |~ "7fe07622-5ceb-4217-847f-a6dce3a463ef"
2. Prefix Search (Progressive Typing)
Users often type incrementally: st → struth → struthers
-- ClickHouse: Leverages primary key ordering
SELECT * FROM keycloak_logs
WHERE message LIKE 'struth%';
3. Full-Text Search (Error Messages)
-- ClickHouse: Token bloom filter
SELECT * FROM keycloak_logs
WHERE hasToken(message, 'connection') AND hasToken(message, 'reset');
4. Category Filtering
-- ClickHouse: Dictionary-encoded, extremely fast
SELECT * FROM keycloak_logs
WHERE category = 'org.keycloak.events'
AND level = 'ERROR'
AND timestamp > now() - INTERVAL 1 DAY;
The Unified Vision: Logs + Events in ClickHouse
Currently, our architecture looks like:
Keycloak Pods → Loki (logs)
Keycloak Events → Kafka → Elasticsearch (events)
We’re consolidating to:
Keycloak Pods → Vector → ClickHouse (logs)
Keycloak Events → Kafka → ClickHouse (events)
Why Unify?
1. Correlated queries
-- Find all logs around a LOGIN_ERROR event
SELECT l.*
FROM keycloak_logs l
JOIN keycloak_events e ON l.cluster_id = e.cluster_id
AND l.timestamp BETWEEN e.timestamp - INTERVAL 5 SECOND
AND e.timestamp + INTERVAL 5 SECOND
WHERE e.event_type = 'LOGIN_ERROR'
AND e.user_id = 'some-user-id';
2. Single query language
No more context-switching between LogQL, Elasticsearch DSL, and SQL. Just SQL everywhere.
3. Cost reduction
Running Loki + Elasticsearch + their dependencies (S3 for Loki chunks, EBS for ES) costs more than a single ClickHouse cluster with S3-backed storage.
4. Operational simplicity
One system to monitor, scale, and maintain instead of two.
Migration Strategy
Phase 1: Dual-Write (Current)
Keycloak → Vector → Loki (existing)
→ ClickHouse (new)
Both systems receive logs. We validate ClickHouse data quality and query performance.
Phase 2: Read Migration
Update the app to query ClickHouse for logs. Loki remains as fallback.
Phase 3: Deprecate Loki
Once confident, stop writing to Loki and decommission.
Phase 4: Events Migration
Move Keycloak events from Elasticsearch to ClickHouse using the same approach.
Performance Expectations
Based on ClickHouse benchmarks with similar workloads:
| Query Type | Loki | ClickHouse (Expected) |
|---|---|---|
| Time range filter (1 day) | 2-5s | <100ms |
| Full-text search | 10-30s | <500ms |
| Aggregation (count by level) | N/A | <200ms |
| UUID exact match | 5-15s | <50ms |
| 90-day query range | Not supported | <2s |
Conclusion
Loki served us well for real-time log streaming and short-term debugging. But as our platform scales to hundreds of Keycloak clusters generating millions of logs daily, we need:
- Longer retention without query limitations
- Faster full-text search for customer support
- Analytics capabilities for product insights
- Unified storage for logs and events
ClickHouse checks all these boxes while reducing operational complexity and cost.
Further Reading
Ready to simplify your authentication?
Deploy production-ready Keycloak in minutes. Unlimited users, flat pricing, no SSO tax.