-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for criteria based DWPT selection inside DocumentWriter #13387
Comments
I like this idea! I hope we can find a simple enough API exposed through IWC to enable the optional grouping. This also has nice mechanical sympathy / symmetry with the distributed search engine analog. A distributed search engine like OpenSearch indexes and searches into N shards across multiple servers, and this is nearly precisely the same logical problem that Lucene tackles on a single multi-core server when indexing and searching into N segments, especially as Lucene's intra-query concurrency becomes the norm/default and improves (e.g. allowing intra-segment per query concurrency as well). We should cross-fertilize more based on this analogy: the two problems are nearly the same. A shard, a segment, same thing heh (nearly). So this proposal is bringing custom document routing feature from OpenSearch, down into Lucene's segments. |
This is an interesting idea! You do not mention it explicitly in the issue description, but presumably this only makes sense if an index sort is configured, otherwise merges may break the clustering that you are trying to create in the first place?
I'm a bit uncomfortable with this approach. It is so heavy that it wouldn't perform much better than maintaining a separate |
Thanks Mike and Adrian for the feedback.
Not exactly. As mentioned, in order to ensure that grouping criteria invariant is maintained even during segment merges, we are introducing a new merge policy that acts as a decorator over the existing Tiered Merge policy. During a segment merge, this policy would categorize segments according to their grouping function outcomes before merging segments within the same category, thus maintaining the grouping criteria’s integrity throughout the merge process.
I believe even if we use a single DWPT pool with rendezvous hashing to distribute DWPTs we would end up creating same number of DWPTs as having different DWPT pools for different group. Consider an example where we are grouping logs based on status code for an index and 8 concurrent indexing thread is indexing 2xx status code logs. This will create 8 DWPTs. Now 4 threads starts indexing 4xx status code logs concurrently, this will require 4 extra DWPTs for indexing logs if we want to maintain status code based grouping. Instead of creating new DWPTs, we can try reusing existing 4 DWPTs created for 2xx status code logs on best effort basis. But this will again mix 4xx status code logs with 2xx status code logs defeating the purpose of status code based grouping of logs. Also to ensure that number of DWPTs created are in check, we will be creating guardrails on number of groups that can be generated from grouping function. Let me know if my understanding is correct. |
Thanks for explaining. The concern I have given how we're planning on never flushing/merging segments from the same group is that this would essentially perform the same as maintaining one To get similar benefits from clustering but without incurring the overhead of segments, I feel like we should rather improve our support for clustering at the doc ID level, ie. index sorting. And maybe ideas like this criteria-based selection of DWPTs could help speed up the creation of sorted indexes? |
Thanks for the suggestion. Above suggestion for clustering within the segment does improves skipping of documents (especially when combined with BKD optimisation to skip non competitive documents). But it still limits us from building multiple optimisations which could be done by having separate DWPT pools for different groups:
Actually, we won't be able to build multiple optimizations on top of the segment topology if we store them together. Let me know if this makes sense. |
I agree that better organizing data across segments yields significant benefits, I'm only advocating for doing this by maintaining a separate |
Sorry missed answering this part in my earlier response. We did explore this approach of creating an IndexWriter/Lucene Index (or OpenSearch shard) for each group. However, implementing this approach would lead to significant overhead on the client side (such as OpenSearch) both in the terms of code changes and operational overhead like metadata management. On the other hand, maintaining separate DWPT pools for different groups would require minimal changes inside Lucene. The overhead will be lesser here as Lucene shard will still be maintained as a single physical unit. Let me know if this makes sense. |
Attaching a preliminary PR for the POC related to above issue to share my understanding. Please note that this is not the final PR. |
Can you give more details? The main difference that comes to mind is that using multiple |
I like @jpountz's idea of just using separate The idea of using a single underlying multi-tenant You would also need a clean-ish way to manage a single total allowed RAM bytes across the N Searching across the N separate shards as if they were a single index is also possible via |
I don't think we do. +1 to exploring this separately. I like that we then wouldn't need to tune the merge policy because it would naturally only see segments that belong to its group.
Right,
Indeed, I'd expect it to work just fine. |
Thanks a lot for suggestions @jpountz and @mikemccand. As suggested above, we worked on a POC to explore using separate IndexWriter for different groups. Each IndexWriter is associated with a distinct logical filter directories, which attaches a filename prefix according to the group. These directories are backed by a single multi tenant directory. However this approach presents several challenges on the Client (OpenSearch) side. Each IndexWriter now generates its own sequence number. In a service like OpenSearch where Translog operates based on sequence numbers at the Lucene Index level. When the same sequence number is generated across different IndexWriter for a same Lucene Index, conflicts can occur during operation like Translog replay. Additionally, local and global checkpoints maintained during recovery operation in service like OpenSearch require sequence number to be a continuous increasing number which won't be valid with multiple IndexWriter. We did not face these issue when different groups were represented by different DWPT pools. This is because there was only a single IndexWriter writing to a Lucene Index, generating a continuous increasing sequence number. The complexity of handling different segments for different groups is managed internally at Lucene level, rather than propagating it to the client side. Feel free to share any further suggestions you may have on this. |
This would indeed get somewhat tricky. But is OpenSearch really using Lucene's returned sequence numbers? I had thought Elasticsearch's sequence number implementation predated the Lucene change adding sequence numbers to every low-level Lucene operation that mutates the index. Under the hood, |
I wonder if we can leverage IndexWriter's This could mean that each shard for an OpenSearch/Elasticsearch index would maintain internal indexes for each desired category, and use the API to combine them into a common "shard" index at every flush? We'd still need a way to maintain category labels for a segment during merging, but that's a common problem for any approach we take. |
Thanks mikemccand and vigyasharma for suggestions. Evaluated different approaches to use different IndexWriter for different groups: Approach 1: Using filter directory for each groupIn this approach, each group (for above example grouping criteria is status code) has its own To address the sequence number conflict between different Pros
Cons
|
Approach 2: Using a physical directory for each groupTo segregate segments belonging to different groups and avoid attaching a prefix to segment names, we associated group-level IndexWriters with a physical directory instead of a filter directory. Pros
Cons
|
Approach 3: Combining group level IndexWriter with addIndexesIn this approach, in order to make multiple group-level IndexWriters function as a unified entity, we use the Lucene’s addIndxes api to combine them. This ensures that the top-level IndexWriter shares a common Pros
Cons
|
SummaryIn summary the problem can be broken down into three sub problems.
With the different approaches we investigated, none of them satisfies/solves the above 3 sub problems cleanly with decent complexity. That leaves us with the originally suggested approach of using different DWPTs to represent different groups. The original approach:
ExploringIn parallel, we are still exploring if we can introduce an API for Open for thoughts and suggestions. |
How do background index merges work with the original, separate DWPT based approach? Don't you need to ensure that you only merge segments that belong to a single group? |
We will be introducing a new merge policy in this case as well to ensure grouping criteria invariant is maintained even during segment merges. Original changes proposed was DWPT side of changes with a new merge policy which ensure same group segments are merged. |
Description
Issue
Today, Lucene internally creates multiple DocumentWriterPerThread (DWPT) instances per index to facilitate concurrent indexing across different ingestion threads. When documents are indexed by the same DWPT, they are grouped into the same segment post flush. As DWPT assignment to documents is only concurrency based, it’s not possible to predict or control the distribution of documents within the segments. For instance, during the indexing of time series logs, its possible for a single DWPT to index logs with both 5xx and 2xx status codes, leading to segments that contains a heterogeneous mix of documents.
Typically, in scenarios like log analytics, users are more interested in a certain subset of data (errors (4XX) and/or fault requests (5XX) requests logs). Randomly assigning DWPT to index document can disperse these relevant documents across multiple segments. Furthermore, if these documents are sparse, they will be thinly spread out even within the segments, necessitating the iteration over many less relevant documents for search queries. While the optimisation to use BKD tree to skip non competitive documents by the collectors significantly improves query performance, actual number of documents iterated still depends on arrangement of data in the segment and how underlying BKD gets constructed.
Storing relevant log documents separately from relatively less relevant ones, such as 2xx logs, can prevent their scattering across multiple segments. This model can markedly enhance query performance by streamlining searches to involve fewer segments and omitting documents that are less relevant. Moreover, clustering related data allows for the pre-computation of aggregations for frequently executed queries (e.g., count, minimum, maximum) and store them as separate metadata. Corresponding queries can be served from the metadata itself, thus optimizing both on the latency and compute.
Proposal
In this proposal, we suggest adding support for DWPT selection mechanism based on a specific criteria within the DocumentWriter. Users can define this criteria through a grouping function as a new IndexWriterConfig configuration. This grouping criteria can be based on the anticipated query pattern in the workload to store frequently queried data together. During indexing, this function would be evaluated for each document, ensuring that documents with differing criteria are indexed using separate DWPTs. For instance, in the context of http request logs, the grouping function could be tailored to assign DWPTs according to the status code in the log entry.
Associated OpenSearch RFC
opensearch-project/OpenSearch#13183
Improvements with new DWPT distribution strategy
We worked on a POC in Lucene and tried integrating it with OpenSearch. We validated DWPT distribution based on different criterias such as status code, timestamp etc against different types of workload. We observed a 50% - 60% improvements in performance of range, aggregation and sort queries with proposed DWPT selection approach.
Implementation Details
User defined grouping criteria function will be passed to DocumentWriter as a new IndexWriterConfig configuration. During indexing of a document, the DocumentWriter will evaluate this grouping function and pass this outcome to the DocumentWriterFlushControl and DocumentWriterThreadPool when requesting a DWPT for indexing the document. The DocumentWriterThreadPool will now maintain a distinct pool of DWPTs for each possible outcome. The specific pool selected for indexing a document will depend on the outcome of the document for the grouping function. Should the relevant pool be empty, a new DWPT will be created and added to this pool. Connecting with above example for http request logs, having a distinct pools for 2xx and 5xx status code logs would ensure that 2xx logs are indexed using a separate set of DWPTs from the 5xx status codes logs. Once a DWPT is designated for flushing, it is checked out of the thread pool and won't be reused for indexing.
Further, in order to ensure that grouping criteria invariant is maintained even during segment merges, we propose a new merge policy that acts as a decorator over the existing Tiered Merge policy. During a segment merge, this policy would categorize segments according to their grouping function outcomes before merging segments within the same category, thus maintaining the grouping criteria’s integrity throughout the merge process.
Guardrails
To mange the system’s resources effectively, guardrails will be implemented to limit the numbers of groups that can be generated from grouping function. User will need to provide a predefined list of acceptable outcomes for the grouping function, along with the function itself. Documents whose grouping function outcome is not within this list will be indexed using a default pool of DWPTs. This limits the number of DWPTs created during indexing, preventing the formation of numerous small segments that could lead to frequent segment merges. Additionally, a cap on DWPT count keeps the JVM utilization and garbage collection in check.
The text was updated successfully, but these errors were encountered: