Blog / engineering

From Loki to ClickHouse: Scaling Our Log Analytics Infrastructure

Guilliano Molaire February 2, 2026 Updated March 15, 2026 6 min read

A technical deep-dive into why we’re moving from Grafana Loki to ClickHouse for Keycloak log storage and analytics

TL;DR

We’re migrating our Keycloak logging infrastructure from Grafana Loki to ClickHouse. After analyzing 3 months of production data (~438 million logs), we found that ClickHouse better suits our needs for:

Full-text search performance
Long-term log retention and analytics
Unified storage for logs + events (replacing Elasticsearch)
Cost-effective scaling

The Current State: Loki

Our managed Keycloak platform runs across 4 regions (US, EU, AU, CA), with each region running its own Loki instance for log aggregation. Keycloak pods ship logs to Loki via Promtail, and our app queries Loki’s API to display logs in the UI.

Log Volume

Across regions, our platform processes millions of logs per day. At that scale, we started hitting Loki’s limits for our use case.

Keycloak Log Format

Keycloak uses JBoss Logging with the default pattern:

%d{yyyy-MM-dd HH:mm:ss,SSS} %-5p [%c] (%t) %s%e%n

Which produces logs like:

2026-01-12 01:49:21,668 DEBUG [org.keycloak.events] (executor-thread-43) type="REFRESH_TOKEN", realmId="993329ad-...", clientId="portal", userId="c0ca2f5d-..."

Field breakdown:

Specifier	Field Name	Example
`%d`	timestamp	`2026-01-12 01:49:21,668`
`%p`	level	`DEBUG`, `INFO`, `WARN`, `ERROR`
`%c`	category	`org.keycloak.events`
`%t`	thread_name	`executor-thread-43`
`%s%e`	message	The log content + exception

Why Loki Falls Short for Our Use Case

1. Query Time Limits

Loki enforces a 30-day maximum query range. For compliance and debugging, customers often need to search logs from 60-90 days ago. We had to implement chunked queries:

// We have to query in 30-day chunks and combine results
for _, period := range []string{"Oct-Nov", "Nov-Dec", "Dec-Jan"} {
    results = append(results, queryLokiChunk(period))
}

2. Result Set Limits

Loki caps results at 5,000 entries per query. For customers with high-traffic Keycloak instances, this means pagination complexity:

// Frontend has to manage cursor-based pagination
const [oldestTimestamp, setOldestTimestamp] = useState<string>();
const [loadedCount, setLoadedCount] = useState(0);
const MAX_LOGS = 50000; // Our self-imposed limit to prevent memory issues

3. Full-Text Search Performance

Analyzing 3 months of user search patterns revealed:

Search Pattern	% of Searches	Example
Error keywords	20%	`error`, `null`, `connection reset`
Usernames/IDs	40%	`jgrigg`, `executor-thread-31`
UUIDs	4%	`7fe07622-5ceb-4217-847f-a6dce3a463ef`
Email addresses	3%	`[email protected]`
Java categories	10%	`[org.keycloak.theme.DefaultThemeManager]`
IP addresses	2%	`165.228.75.218`

Loki’s LogQL isn’t optimized for these mixed search patterns. Regex searches like |~ "[email protected]" perform full scans, which degrades with scale.

4. No Native Analytics

We want to answer questions like:

“How many LOGIN_ERROR events per realm this week?”
“Which clients generate the most token refresh requests?”
“What’s the P95 latency trend for authentication?”

Loki requires exporting to another system for analytics. ClickHouse handles both in one place.

Cardinality Analysis: Why ClickHouse Fits

Before designing our ClickHouse schema, we analyzed field cardinality across 10,000 sampled logs:

Log Levels (Cardinality: 5)

DEBUG    47%
INFO     43%
WARN      9%
FINE      0.2%
ERROR     0.1%

Verdict: Perfect for LowCardinality(String)

Categories (Cardinality: ~31)

org.keycloak.events                                    45%
io.quarkus.micrometer.runtime.binder.vertx.*          23%
org.hibernate.query.sqm.sql.*                           9%
org.keycloak.protocol.oidc.utils.AcrUtils               7%
... (27 more)

Verdict: Low cardinality, perfect for LowCardinality(String)

Thread Names (Raw vs Normalized)

Raw thread names have HIGH cardinality due to numeric worker IDs:

executor-thread-1, executor-thread-2, … executor-thread-999

But normalized, they collapse to just 6 unique values:

executor-thread-N           55%
vert.x-eventloop-thread-N   26%
Timer-N                     17%
Periodic Recovery            2%
Transaction Expired Entry    0.1%
Connection evictor           0.02%

Normalization regex:

replaceRegexpAll(thread_name, '-\d+$', '-N')

Verdict: Store raw for debugging, add materialized normalized column for LowCardinality

The ClickHouse Schema

CREATE TABLE keycloak_logs (
    -- Timestamp with millisecond precision
    timestamp DateTime64(3),

    -- Identifiers
    cluster_id LowCardinality(String),

    -- Log structure (all low cardinality)
    level LowCardinality(String),
    category LowCardinality(String),
    thread_name String,
    thread_name_base LowCardinality(String) MATERIALIZED
        replaceRegexpAll(thread_name, '-\d+$', '-N'),

    -- Content
    message String,

    -- Parsed from org.keycloak.events messages
    event_type LowCardinality(String),
    realm_id String,
    realm_name LowCardinality(String),
    client_id LowCardinality(String),
    user_id String,
    session_id String,
    ip_address String,

    -- Indexing
    INDEX idx_message message TYPE tokenbf_v1(10240, 3, 0) GRANULARITY 4,
    INDEX idx_user_id user_id TYPE bloom_filter() GRANULARITY 4,
    INDEX idx_ip ip_address TYPE bloom_filter() GRANULARITY 4
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY (cluster_id, level, timestamp)
TTL timestamp + INTERVAL 90 DAY
SETTINGS index_granularity = 8192;

Key Design Decisions

1. LowCardinality for enum-like fields

ClickHouse’s LowCardinality type uses dictionary encoding, dramatically reducing storage and speeding up filtering for fields with < 10,000 unique values.

2. Bloom filter indexes for high-cardinality searches

For user_id, ip_address, and message content, bloom filters enable fast “needle in haystack” queries without full scans.

3. Materialized thread_name_base

Computed at insert time, this gives us the best of both worlds: raw thread names for debugging, normalized names for analytics.

4. 90-day TTL

Automatic data expiration handles retention without manual cleanup jobs.

5. Partition by month

Enables efficient partition pruning for time-range queries and simplified data lifecycle management.

Search Pattern Optimization

Based on our analysis of real user searches, we’re optimizing for these patterns:

1. Exact Match (UUIDs, IPs, Emails)

-- ClickHouse: Uses bloom filter, O(1) lookup
SELECT * FROM keycloak_logs
WHERE user_id = '7fe07622-5ceb-4217-847f-a6dce3a463ef';

-- Loki: Full regex scan
{container="keycloak"} |~ "7fe07622-5ceb-4217-847f-a6dce3a463ef"

2. Prefix Search (Progressive Typing)

Users often type incrementally: st → struth → struthers

-- ClickHouse: Leverages primary key ordering
SELECT * FROM keycloak_logs
WHERE message LIKE 'struth%';

3. Full-Text Search (Error Messages)

-- ClickHouse: Token bloom filter
SELECT * FROM keycloak_logs
WHERE hasToken(message, 'connection') AND hasToken(message, 'reset');

4. Category Filtering

-- ClickHouse: Dictionary-encoded, extremely fast
SELECT * FROM keycloak_logs
WHERE category = 'org.keycloak.events'
  AND level = 'ERROR'
  AND timestamp > now() - INTERVAL 1 DAY;

The Unified Vision: Logs + Events in ClickHouse

Currently, our architecture looks like:

Keycloak Pods → Loki (logs)
Keycloak Events → Kafka → Elasticsearch (events)

We’re consolidating to:

Keycloak Pods → Vector → ClickHouse (logs)
Keycloak Events → Kafka → ClickHouse (events)

Why Unify?

1. Correlated queries

-- Find all logs around a LOGIN_ERROR event
SELECT l.*
FROM keycloak_logs l
JOIN keycloak_events e ON l.cluster_id = e.cluster_id
  AND l.timestamp BETWEEN e.timestamp - INTERVAL 5 SECOND
                      AND e.timestamp + INTERVAL 5 SECOND
WHERE e.event_type = 'LOGIN_ERROR'
  AND e.user_id = 'some-user-id';

2. Single query language

No more context-switching between LogQL, Elasticsearch DSL, and SQL. Just SQL everywhere.

3. Cost reduction

Running Loki + Elasticsearch + their dependencies (S3 for Loki chunks, EBS for ES) costs more than a single ClickHouse cluster with S3-backed storage.

4. Operational simplicity

One system to monitor, scale, and maintain instead of two.

Migration Strategy

Phase 1: Dual-Write (Current)

Keycloak → Vector → Loki (existing)
                  → ClickHouse (new)

Both systems receive logs. We validate ClickHouse data quality and query performance.

Phase 2: Read Migration

Update the app to query ClickHouse for logs. Loki remains as fallback.

Phase 3: Deprecate Loki

Once confident, stop writing to Loki and decommission.

Phase 4: Events Migration

Move Keycloak events from Elasticsearch to ClickHouse using the same approach.

Performance Expectations

Based on ClickHouse benchmarks with similar workloads:

Query Type	Loki	ClickHouse (Expected)
Time range filter (1 day)	2-5s	<100ms
Full-text search	10-30s	<500ms
Aggregation (count by level)	N/A	<200ms
UUID exact match	5-15s	<50ms
90-day query range	Not supported	<2s

Conclusion

Loki served us well for real-time log streaming and short-term debugging. But as our platform scales to hundreds of Keycloak clusters generating millions of logs daily, we need:

Longer retention without query limitations
Faster full-text search for customer support
Analytics capabilities for product insights
Unified storage for logs and events

ClickHouse checks all these boxes while reducing operational complexity and cost.