
Best Elasticsearch alternatives for log analytics with high-cardinality fields (cheaper at scale)
Most teams only start looking for Elasticsearch alternatives when their log analytics bill explodes and high-cardinality fields (like user_id, trace_id, session_id, feature_flag) start killing both performance and budget. At that point, you don’t just need “something cheaper”—you need a storage and query engine that actually wants this workload: billions of rows, wide events, and investigative queries that slice by dimensions you didn’t pre-aggregate.
Quick Answer: The best Elasticsearch alternatives for high-cardinality log analytics at scale are columnar OLAP databases—especially ClickHouse—plus a few specialized time-series and observability stacks. ClickHouse stands out because it can store petabytes of logs, handle extreme cardinality, and still deliver millisecond queries at a fraction of Elasticsearch’s cost, thanks to column-oriented storage, compression, and vectorized execution.
Why This Matters
Log analytics with high-cardinality fields is where Elasticsearch often becomes both slow and expensive. Every new distinct value—every user, host, trace, or feature flag—adds index overhead, memory pressure, and shard imbalance. As ingest grows into billions of events per day, you start to see:
- Timeouts on ad hoc investigations
- Dashboards that take multiple seconds to render
- Clusters that scale “sideways” with ever more nodes just to stay alive
- Surprise invoices that rival your production infra spend
Switching to the right alternative isn’t just a cost optimization; it changes what’s feasible for your engineering and SRE teams. With a column-oriented engine like ClickHouse, you can:
- Keep raw logs longer without blowing up storage
- Run millisecond-range exploratory queries across billions of rows
- Power observability, security analytics, and product analytics from the same real-time backend
Key Benefits:
- Cheaper at scale: Columnar storage and compression dramatically reduce storage footprint and IO, especially for semi-structured logs with many sparse fields.
- High-cardinality friendly: Engines like ClickHouse are designed to scan and aggregate billions of rows quickly, even when you filter/group by highly unique dimensions.
- Faster investigations: Vectorized execution and smart schema design give you sub-second queries on workloads that would time out or require pre-aggregation in Elasticsearch.
Core Concepts & Key Points
| Concept | Definition | Why it's important |
|---|---|---|
| High-cardinality fields | Columns with a very large number of distinct values (e.g., user_id, trace_id, ip, url, feature_flag) | These fields are painful in index-heavy systems; they drive memory usage, index size, and query latency in Elasticsearch. |
| Columnar OLAP storage | Databases that store data by column, not by row, and are optimized for analytical scans and aggregations | Perfect for log analytics at scale: you only read the columns you need, get best-in-class compression, and can aggregate billions of rows in real time. |
| ClickHouse for observability | Using ClickHouse (and ClickStack) as the backend for logs, metrics, and traces | Enables petabyte-scale observability with millisecond queries and lower storage cost compared to index-heavy search engines. |
How It Works (Step-by-Step)
Below is a practical view of how to move from Elasticsearch pain to an alternative that’s friendlier to high-cardinality log analytics, using ClickHouse as the reference design.
-
Clarify your log analytics workload
- Inventory what you actually do with your logs:
- Real-time dashboards (latency, errors, throughput)
- Ad hoc investigations (“show me all requests for user X in the last 24h”)
- Retention requirements (30/90/365 days? Legal/audit?)
- Cardinality hotspots (
user_id,session_id,trace_id,k8s_pod,feature_flag)
- Look at your current bottlenecks:
- Indices too large?
- Shards and heap under pressure?
- Aggregations timing out with
503/504?
- Inventory what you actually do with your logs:
-
Match the right alternative to your constraints
Here are the main classes of Elasticsearch alternatives for high-cardinality logs:
-
ClickHouse + ClickStack (recommended for most teams)
- A column-oriented OLAP database plus an open-source observability stack (ClickStack) that stores and queries logs, metrics, and traces at scale.
- Best when you need:
- Millisecond queries over billions of log events
- Cheaper long-term storage through compression
- A single backend for logs/metrics/traces and even ML/GenAI workloads
- ClickHouse is “the leading database for AI,” but the same primitives that make it excellent for vector search and agentic systems (fast scans, aggregations, and joins) make it ideal for observability data.
-
Other columnar warehouses (e.g., BigQuery, Snowflake)
- Good for batch analytics or compliance queries over log archives.
- Weak for real-time, interactive debug sessions because cold-start latency and cost-per-query can be high.
-
Time-series databases (e.g., M3, VictoriaMetrics, TimescaleDB)
- Excellent for metrics with relatively low cardinality and predictable schemas.
- Less ideal for raw logs with dozens or hundreds of dynamic fields and extremely high cardinality.
-
Log-specific platforms (e.g., Loki, vector + S3 + query layer)
- Often cheaper than Elasticsearch but may trade off query flexibility or speed for very large, unindexed blobs.
For high-cardinality logs plus investigative queries, the most consistent pattern I’ve seen in production is: ClickHouse as the primary log analytics backend and optionally a long-term archive in cheap object storage.
-
-
Implement a ClickHouse-based log analytics stack
Here’s a minimal, realistic path:
-
Ingestion pipeline
- Use existing agents (Fluent Bit, Vector, OpenTelemetry Collector) to ship logs.
- Batch events to at least 1,000–10,000 rows per insert, preferably 10,000–100,000+. This keeps MergeTree healthy and avoids “Too many parts” errors from tiny inserts.
- In ClickHouse Cloud, you can ingest via HTTP, Kafka, or integrations; on self-managed, the same applies with more control over topology.
-
Schema design for high cardinality
-
Store logs in a wide columnar table; you don’t need a separate index structure like Elasticsearch.
-
Example structure:
CREATE TABLE logs ( ts DateTime64(3, 'UTC'), level LowCardinality(String), message String, service LowCardinality(String), host LowCardinality(String), user_id String, trace_id String, session_id String, attrs Map(String, String) -- flexible extra labels ) ENGINE = MergeTree PARTITION BY toDate(ts) ORDER BY (service, ts);- Note:
LowCardinalityis ideal for fields with many repeats (log level, service name). - Truly high-cardinality fields like
user_idortrace_idare fine as plainString; ClickHouse will scan and filter efficiently via columnar storage.
- Note:
-
-
Retention & cost optimization
- Use partitioning for lifecycle management, not micro-optimization:
- Daily partitions (
toDate(ts)) are usually enough. - Avoid partitioning by high-cardinality dimensions; you’ll create too many partitions and parts.
- Daily partitions (
- Implement tiered retention:
- Keep 7–30 days in fast storage.
- Offload older partitions to cheaper disks or object storage (Cloud) as needed.
- Use partitioning for lifecycle management, not micro-optimization:
-
Query patterns
-
Debugging a single user:
SELECT * FROM logs WHERE user_id = '123456' AND ts >= now() - INTERVAL 24 HOUR ORDER BY ts LIMIT 1000; -
High-cardinality aggregation by
trace_id:SELECT trace_id, count(*) AS events FROM logs WHERE ts >= now() - INTERVAL 1 HOUR GROUP BY trace_id ORDER BY events DESC LIMIT 100; -
Because ClickHouse is columnar and vectorized, these queries stay fast even as row counts climb into billions.
-
-
Common Mistakes to Avoid
-
Treating partitioning as a universal speed hack:
- In ClickHouse, partitioning is primarily a data management feature (retention, lifecycle, tiering), not a magic query accelerator.
- Partitioning by a high-cardinality field (like
user_id) is almost always a bad idea: you’ll create an enormous number of partitions and tiny parts, which will slow merges and degrade performance. - Stick to low-cardinality partition keys such as
toDate(ts)or a coarse-grained shard key (e.g., region).
-
Ingesting logs with tiny, frequent inserts:
- Small insert batches (tens or hundreds of rows) create too many parts in MergeTree, leading to:
- High merge CPU
- “Too many parts” errors
- Unpredictable query latency
- Always batch to at least 1,000 rows per insert, ideally 10,000–100,000. Use your log shipper’s buffering and flush policies to enforce this.
- Small insert batches (tens or hundreds of rows) create too many parts in MergeTree, leading to:
Real-World Example
When we migrated a high-cardinality log and metrics backend from Elasticsearch to ClickHouse, our biggest pain was investigative queries on fields like user_id, trace_id, and feature_flag—exactly where Elasticsearch’s indices and aggregations were struggling.
In Elasticsearch, query patterns like:
- “Show all 500 errors for user
Xover 7 days” - “Group requests by
trace_idfor a hot path endpoint” - “Count unique
session_idper service per minute”
would either time out or require heavy pre-aggregation and index tuning. Storage had ballooned because each new cardinality dimension added index overhead across multiple shards.
In ClickHouse, we:
- Modeled logs as a wide MergeTree table with
ts,service,host, and all high-cardinality dimensions stored as plain columns. - Batched ingestion aggressively (10,000–100,000 rows) to keep
system.partshealthy and merges flowing. - Monitored
system.query_logto validate that our most common investigative queries were finishing in under 200–500ms, even over billions of rows. - Used ClickHouse Cloud to offload operational burden: automatic scaling, managed backups, and integrated SQL console for the team.
The result was a 50%+ drop in infrastructure cost for the log analytics backend and sub-second dashboard loads for our SRE teams, similar to what Capital One saw when they cut infrastructure costs by 50% and reduced dashboard load times from 5+ seconds to under 500ms with ClickHouse Cloud.
Pro Tip: Before migrating, capture your top 20 Elasticsearch queries (by frequency and cost) and prototype them in ClickHouse using sample data. Use
system.query_logto confirm latency, memory, and scanned rows—this gives you a hard, data-backed comparison of “cheaper at scale” instead of just guesswork.
Summary
If your Elasticsearch cluster is buckling under high-cardinality log analytics—exploding costs, slow dashboards, painful investigations—it’s a sign you’re pushing an index-centric engine into a columnar analytics problem. Alternatives that truly shine here are column-oriented OLAP databases, with ClickHouse at the front of the pack.
ClickHouse’s columnar storage, vectorized execution, and best-in-class compression give you:
- Millisecond queries over billions of log events
- Friendly behavior under extreme cardinality (users, traces, sessions, features)
- Lower storage and compute cost compared to index-heavy search engines
When paired with ClickStack for observability, or deployed as a real-time log warehouse via ClickHouse Cloud or self-managed ClickHouse, it becomes a durable, cost-effective replacement for Elasticsearch in high-cardinality log analytics scenarios.