Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issue of red index on close for remote enabled clusters #15990

Merged
merged 3 commits into from
Sep 25, 2024

Conversation

ashking94
Copy link
Member

@ashking94 ashking94 commented Sep 19, 2024

Description

The close index operation involves following steps -

  1. Start closing indices by adding a write block
  2. Wait for the operations on the shards to be completed
    1. Acquire all indexing operation permits to ensure that all operations have completed indexing
  3. After acquiring all indexing permits, closing a index involves 2 phases -
    1. Sync translog
    2. Flush Index Shard
  4. Move index states from OPEN to CLOSE in cluster state for indices that are ready for closing

During a happy index close, we upload translog twice -

  • 1st time, as part of the 3.a. Sync Translog step, the indexing operations are uploaded
  • 2nd time, as part of the 3.b. Flush Index Shard step, the latest GCP is uploaded.

However, if there is a flush that has happened after the operation landed in the Lucene buffer but before the buffered sync (for sync translog) or the periodic async sync (for async translog), then the steps 3(a) and 3(b) becomes no-op and the GCP uploaded in the checkpoint file would be the one from the last translog sync. This causes the discrepancy between maxSeqNo and GCP and causing exception while creating ReadOnlyEngine leading to red index.

In this PR, changes are made to track the global checkpoint that has been updated as part of the successful translog upload to remote store. The new tracked global checkpoint is now also used in the RemoteFsTranslog.syncNeeded() method and checked against the current (translog writer) last synced global checkpoint.

Related Issues

Resolves #15989

Check List

  • Functionality includes testing.
  • [ ] API changes companion pull request created, if applicable.
  • [ ] Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@ashking94 ashking94 added backport 2.x Backport to 2.x branch skip-changelog labels Sep 19, 2024
Copy link
Contributor

❌ Gradle check result for a1d5a87: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@ashking94
Copy link
Member Author

❌ Gradle check result for a1d5a87: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

I have added tests for the edge case that is mentioned in the referenced issue. I had added the tests first and then the main code changes -

[org.opensearch.remotestore.RemoteStoreIT.testCloseIndexWithNoOpSyncAndFlushForSyncTranslog](https://build.ci.opensearch.org/job/gradle-check/48058/testReport/junit/org.opensearch.remotestore/RemoteStoreIT/testCloseIndexWithNoOpSyncAndFlushForSyncTranslog/)
[org.opensearch.remotestore.RemoteStoreIT.testCloseIndexWithNoOpSyncAndFlushForSyncTranslog](https://build.ci.opensearch.org/job/gradle-check/48058/testReport/junit/org.opensearch.remotestore/RemoteStoreIT/testCloseIndexWithNoOpSyncAndFlushForSyncTranslog_2/)
[org.opensearch.remotestore.RemoteStoreIT.testCloseIndexWithNoOpSyncAndFlushForSyncTranslog](https://build.ci.opensearch.org/job/gradle-check/48058/testReport/junit/org.opensearch.remotestore/RemoteStoreIT/testCloseIndexWithNoOpSyncAndFlushForSyncTranslog_3/)
[org.opensearch.remotestore.RemoteStoreIT.testCloseIndexWithNoOpSyncAndFlushForSyncTranslog](https://build.ci.opensearch.org/job/gradle-check/48058/testReport/junit/org.opensearch.remotestore/RemoteStoreIT/testCloseIndexWithNoOpSyncAndFlushForSyncTranslog_4/)

Copy link
Contributor

❌ Gradle check result for ac864f0: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@ashking94 ashking94 changed the title [Remote Store] Emit correct global checkpoint during translog upload Fix issue of red index on close for remote enabled clusters Sep 19, 2024
@ashking94 ashking94 marked this pull request as ready for review September 19, 2024 10:06
Copy link
Collaborator

@Bukhtawar Bukhtawar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering is this the problem with just remote translog or local translog as well?

@ashking94
Copy link
Member Author

ashking94 commented Sep 23, 2024

Wondering is this the problem with just remote translog or local translog as well?

The problem seems to exist for remote translog only since the local version seems fine. When we close the index (in case of remote translog), the translog is wiped out locally first and the rehydrated from remote store. At the point, the most recent checkpoint file downloaded has a global checkpoint from the last but 1 translog sync.

Signed-off-by: Ashish Singh <ssashish@amazon.com>
Copy link
Contributor

✅ Gradle check result for 27b6828: SUCCESS

Copy link
Collaborator

@gbbafna gbbafna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes looks great .

Signed-off-by: Ashish Singh <ssashish@amazon.com>
Copy link
Contributor

❌ Gradle check result for d32dea3: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Ashish Singh <ssashish@amazon.com>
Copy link
Contributor

❕ Gradle check result for 94f998e: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

@ashking94 ashking94 merged commit f1acc7a into opensearch-project:main Sep 25, 2024
33 of 34 checks passed
@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 128

To backport manually, run these commands in your terminal:

# Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/OpenSearch/backport-2.x 2.x
# Navigate to the new working tree
pushd ../.worktrees/OpenSearch/backport-2.x
# Create a new branch
git switch --create backport/backport-15990-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 f1acc7aad7db4c3c9ce2e0ac331b02105ddc85f5
# Push it to GitHub
git push --set-upstream origin backport/backport-15990-to-2.x
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/OpenSearch/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-15990-to-2.x.

@ashking94 ashking94 added backport 2.x Backport to 2.x branch and removed backport 2.x Backport to 2.x branch labels Sep 25, 2024
opensearch-trigger-bot bot pushed a commit that referenced this pull request Sep 25, 2024
* Fix red index on close for remote translog

Signed-off-by: Ashish Singh <ssashish@amazon.com>

* Add UTs

Signed-off-by: Ashish Singh <ssashish@amazon.com>

---------

Signed-off-by: Ashish Singh <ssashish@amazon.com>
(cherry picked from commit f1acc7a)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: ✅ Done
Development

Successfully merging this pull request may close these issues.

[BUG] On remote store enabled cluster closing index sometimes makes the index red
3 participants