**Problem**
In our environment, we manage a large-scale dataset (hundreds of millions of documents) within a single Elasticsearch cluster. A critical pain point was the high frequency of unnecessary reindexing operations, triggered primarily by data invalidation signals.
Every time a source data record was updated, an invalidation signal was generated, leading to an attempt to reindex the corresponding Elasticsearch document. With millions of documents and a high rate of updates (e.g., millions per day), this resulted in a significant load on our Elasticsearch cluster.
**Core Issue**
The core issue was that not every field in the source data record corresponds to a field we actually index or use for searching in Elasticsearch. For example, a source record might have 50 fields, but only 10 of these fields are mapped to the Elasticsearch index. If one of the remaining 40 non-indexed fields changes, we would still receive an invalidation signal and unnecessarily reindex the document, even though the content exposed via Elasticsearch remained identical.
**Solution**
Press enter or click to view image in full size
Even when source data changes, reindexing is skipped if indexed content remains identical, using hash comparison to eliminate unnecessary Elasticsearch updates
To eliminate redundant reindexing, we introduced a pre-indexing check mechanism utilizing Redis and XXHash. The goal was to quickly determine if the fields relevant to the Elasticsearch document had actually changed since the last indexing.
**The Mechanism**
Relevant Data Extraction: When a document is about to be indexed, we first extract only the subset of fields that are actually mapped to the Elasticsearch index.
Hash Generation: We serialize this subset of relevant fields (e.g., into a canonical JSON format) and calculate a unique hash using XXHash. XXHash was chosen for its high performance and low collision rate, which is crucial for high-throughput operations.
**Redis Usage**
We currently have 155 million keys in Redis, and the memory usage is tens of gigabytes of memory. Also in our Redis nodes, BGSAVE was enabled and repl-diskless-sync was disabled. Since we only need in-memory usage (not disk), this setup was causing temporary memory spikes. We disabled these features so Redis works fully in-memory, and this helped prevent memory peaks.
**Architectural Benefits**
Redis’s Role: Redis provides the necessary speed for this lookup/comparison mechanism. Since the check is performed for every potential reindex, low latency is non-negotiable.
No Data Loss: A key architectural decision was ensuring robustness. If Redis becomes temporarily unavailable (down), we default to the safe behavior: always index. While this temporarily reintroduces unnecessary reindexing, we don’t risk missing a necessary update. Since the source of truth (the actual data) is used to generate the hash, Redis only acts as a caching/comparison layer, meaning no persistent data is lost.
**Improvements**
The implementation of the hash-based check delivered significant improvements across our indexing pipeline and Elasticsearch cluster performance.
CPU Usage Reduction: By skipping the entire document transformation and network communication phases for redundant updates, the CPU load dropped by more than half. This allowed us to scale down resources and improve cost efficiency.
Indexing Rate: By eliminating unnecessary reindexes, which constituted over 70% of our previous invalidation signals, we significantly freed up critical capacity. This allows us to handle genuine data updates much faster.
Search Latency: Lower indexing throughput led to fewer segment merges in Elasticsearch. Since segment merging is CPU-intensive, this reduction translates into lower CPU usage and ultimately better search latency.
Segment Count: With fewer indexing operations, the number of segments created in Elasticsearch decreased noticeably. This reduced merge pressure and contributed to a more stable and efficient cluster.
**Acknowledgements**
We would like to thank the Search Core team for their support, insights, and collaboration throughout this work. This improvement would not have been possible without their contributions.