Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] On remote store enabled cluster closing index sometimes makes the index red #15989

Closed
ashking94 opened this issue Sep 19, 2024 · 0 comments · Fixed by #15990
Closed

[BUG] On remote store enabled cluster closing index sometimes makes the index red #15989

ashking94 opened this issue Sep 19, 2024 · 0 comments · Fixed by #15990
Assignees
Labels
bug Something isn't working Storage:Remote

Comments

@ashking94
Copy link
Member

ashking94 commented Sep 19, 2024

Issue Summary

I have seen multiple cluster now that has turned red during close. Closing an index involves closing the IndexShard and reopening the same with ReadOnlyEngine. ReadOnlyEngine has an extra validation that checks that the maxSeqNo (which is obtained from the segments) is equal to the global checkpoint a.k.a GCP (which is obtained from translog.ckp). In the case where the closed index is turning red, I am seeing exceptions showing that the maxSeqNo is greater than the GCP.

This is the below check that leads to Engine failure.

private static SeqNoStats buildSeqNoStats(EngineConfig config, SegmentInfos infos) {
    final SequenceNumbers.CommitInfo seqNoStats = SequenceNumbers.loadSeqNoInfoFromLuceneCommit(infos.userData.entrySet());
    long maxSeqNo = seqNoStats.maxSeqNo;
    long localCheckpoint = seqNoStats.localCheckpoint;
    return new SeqNoStats(maxSeqNo, localCheckpoint, config.getGlobalCheckpointSupplier().getAsLong());
}

protected void ensureMaxSeqNoEqualsToGlobalCheckpoint(final SeqNoStats seqNoStats) {
    if (requireCompleteHistory == false) {
        return;
    }
    // Before 8.0 the global checkpoint is not known and up to date when the engine is created after
    // peer recovery, so we only check the max seq no / global checkpoint coherency when the global
    // checkpoint is different from the unassigned sequence number value.
    // In addition to that we only execute the check if the index the engine belongs to has been
    // created after the refactoring of the Close Index API and its TransportVerifyShardBeforeCloseAction
    // that guarantee that all operations have been flushed to Lucene.
    final Version indexVersionCreated = engineConfig.getIndexSettings().getIndexVersionCreated();
    if (indexVersionCreated.onOrAfter(LegacyESVersion.V_7_2_0)
        || (seqNoStats.getGlobalCheckpoint() != SequenceNumbers.UNASSIGNED_SEQ_NO)) {
        assert assertMaxSeqNoEqualsToGlobalCheckpoint(seqNoStats.getMaxSeqNo(), seqNoStats.getGlobalCheckpoint());
        if (seqNoStats.getMaxSeqNo() != seqNoStats.getGlobalCheckpoint()) {
            throw new IllegalStateException(
                "Maximum sequence number ["
                    + seqNoStats.getMaxSeqNo()
                    + "] from last commit does not match global checkpoint ["
                    + seqNoStats.getGlobalCheckpoint()
                    + "]"
            );
        }
    }
}

Root cause

The close index operation involves following steps -

  1. Start closing indices by adding a write block
  2. Wait for the operations on the shards to be completed
    1. Acquire all indexing operation permits to ensure that all operations have completed indexing
  3. After acquiring all indexing permits, closing a index involves 2 phases -
    1. Sync translog
    2. Flush Index Shard
  4. Move index states from OPEN to CLOSE in cluster state for indices that are ready for closing

During a happy index close, we upload translog twice -

  • 1st time, as part of the 3.a. Sync Translog step, the indexing operations are uploaded
  • 2nd time, as part of the 3.b. Flush Index Shard step, the latest GCP is uploaded.

However, if there is a flush that has happened after the operation landed in the Lucene buffer but before the buffered sync (for sync translog) or the periodic async sync (for async translog), then the steps 3(a) and 3(b) becomes no-op and the GCP uploaded in the checkpoint file would be the one from the last translog sync. This causes the discrepancy between maxSeqNo and GCP and causing exception while creating ReadOnlyEngine leading to red index.

Related component

Storage:Remote

To Reproduce

We can reproduce this issue for both sync and async translog quite easily -

Sync translog

Step 1 - Set final settting indices.memory.shard_inactive_time as 10h

Step 2 -

curl -X PUT "localhost:9200/_cluster/settings?pretty" -H 'Content-Type: application/json' -d'
{                           
  "persistent": {
    "cluster.remote_store.translog.buffer_interval": "10s"
  }
}
'

Step 3 -

for i in {1..1}; do
  echo $i
  curl -X PUT "localhost:9200/index{$i}?pretty" -H 'Content-Type: application/json' -d'
  {
    "settings": {
      "index": {
        "number_of_shards": 1,
        "number_of_replicas": 0
      }
    }
  }
  '
done

Step 4 -

for j in {1..1}; do
  for i in {1..1}; do
    echo $i
        curl --location --request POST "localhost:9200/index{$i}/_doc?pretty"  \
        --header 'Content-Type: application/json' \
        --data-raw "{
          \"name1\":\"Ashish\"
        }"
  done
done

Step 5 (this should immediately after step 4) -

curl localhost:9201/*/_flush

Step 6 (this should immediately after step 5) -

curl -XPOST localhost:9200/index1/_close

You should see exception stack trace like below -

opensearch-master1  | [2024-09-18T10:54:33,058][WARN ][o.o.c.r.a.AllocationService] [opensearch-master1] failing shard [failed shard, shard [index1][0], node[A8Yi-Gj_RUGAdYdtQviB6w], [P], recovery_source[existing store recovery; bootstrap_history_uuid=false], s[INITIALIZING], a[id=K0EWkYOySu6Q44PI2dkXEg], unassigned_info[[reason=ALLOCATION_FAILED], at[2024-09-18T10:54:32.935Z], failed_attempts[4], failed_nodes[[A8Yi-Gj_RUGAdYdtQviB6w]], delayed=false, details[failed shard on node [A8Yi-Gj_RUGAdYdtQviB6w]: failed recovery, failure RecoveryFailedException[[index1][0]: Recovery failed on {opensearch-node1}{A8Yi-Gj_RUGAdYdtQviB6w}{JsaEQ1-KTkWXZb9P_o8g-w}{172.18.0.4}{172.18.0.4:9300}{dir}{shard_indexing_pressure_enabled=true}]; nested: IndexShardRecoveryException[failed recovery]; nested: IllegalStateException[Maximum sequence number [7] from last commit does not match global checkpoint [6]]; ], allocation_status[no_valid_shard_copy]], message [failed recovery], failure [RecoveryFailedException[[index1][0]: Recovery failed on {opensearch-node1}{A8Yi-Gj_RUGAdYdtQviB6w}{JsaEQ1-KTkWXZb9P_o8g-w}{172.18.0.4}{172.18.0.4:9300}{dir}{shard_indexing_pressure_enabled=true}]; nested: IndexShardRecoveryException[failed recovery]; nested: IllegalStateException[Maximum sequence number [7] from last commit does not match global checkpoint [6]]; ], markAsStale [true]]
opensearch-master1  | org.opensearch.indices.recovery.RecoveryFailedException: [index1][0]: Recovery failed on {opensearch-node1}{A8Yi-Gj_RUGAdYdtQviB6w}{JsaEQ1-KTkWXZb9P_o8g-w}{172.18.0.4}{172.18.0.4:9300}{dir}{shard_indexing_pressure_enabled=true}
opensearch-master1  |     at org.opensearch.index.shard.IndexShard.lambda$executeRecovery$33(IndexShard.java:3893) ~[opensearch-2.18.0-SNAPSHOT.jar:2.18.0-SNAPSHOT]
opensearch-master1  |     at org.opensearch.core.action.ActionListener$1.onFailure(ActionListener.java:90) ~[opensearch-core-2.18.0-SNAPSHOT.jar:2.18.0-SNAPSHOT]
opensearch-master1  |     at org.opensearch.index.shard.StoreRecovery.lambda$recoveryListener$10(StoreRecovery.java:627) ~[opensearch-2.18.0-SNAPSHOT.jar:2.18.0-SNAPSHOT]
opensearch-master1  |     at org.opensearch.core.action.ActionListener$1.onFailure(ActionListener.java:90) ~[opensearch-core-2.18.0-SNAPSHOT.jar:2.18.0-SNAPSHOT]
opensearch-master1  |     at org.opensearch.core.action.ActionListener.completeWith(ActionListener.java:347) ~[opensearch-core-2.18.0-SNAPSHOT.jar:2.18.0-SNAPSHOT]
opensearch-master1  |     at org.opensearch.index.shard.StoreRecovery.recoverFromStore(StoreRecovery.java:124) ~[opensearch-2.18.0-SNAPSHOT.jar:2.18.0-SNAPSHOT]
opensearch-master1  |     at org.opensearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:2902) ~[opensearch-2.18.0-SNAPSHOT.jar:2.18.0-SNAPSHOT]
opensearch-master1  |     at org.opensearch.action.ActionRunnable$2.doRun(ActionRunnable.java:89) ~[opensearch-2.18.0-SNAPSHOT.jar:2.18.0-SNAPSHOT]
opensearch-master1  |     at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:1005) ~[opensearch-2.18.0-SNAPSHOT.jar:2.18.0-SNAPSHOT]
opensearch-master1  |     at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[opensearch-2.18.0-SNAPSHOT.jar:2.18.0-SNAPSHOT]
opensearch-master1  |     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
opensearch-master1  |     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
opensearch-master1  |     at java.lang.Thread.run(Thread.java:1583) [?:?]
opensearch-master1  | Caused by: org.opensearch.index.shard.IndexShardRecoveryException: failed recovery
opensearch-master1  |     ... 11 more
opensearch-master1  | Caused by: java.lang.IllegalStateException: Maximum sequence number [7] from last commit does not match global checkpoint [6]
opensearch-master1  |     at org.opensearch.index.engine.ReadOnlyEngine.ensureMaxSeqNoEqualsToGlobalCheckpoint(ReadOnlyEngine.java:201) ~[opensearch-2.18.0-SNAPSHOT.jar:2.18.0-SNAPSHOT]
opensearch-master1  |     at org.opensearch.index.engine.ReadOnlyEngine.<init>(ReadOnlyEngine.java:143) ~[opensearch-2.18.0-SNAPSHOT.jar:2.18.0-SNAPSHOT]
opensearch-master1  |     at org.opensearch.index.engine.NoOpEngine.<init>(NoOpEngine.java:77) ~[opensearch-2.18.0-SNAPSHOT.jar:2.18.0-SNAPSHOT]
opensearch-master1  |     at org.opensearch.index.shard.IndexShard.innerOpenEngineAndTranslog(IndexShard.java:2612) ~[opensearch-2.18.0-SNAPSHOT.jar:2.18.0-SNAPSHOT]
opensearch-master1  |     at org.opensearch.index.shard.IndexShard.openEngineAndRecoverFromTranslog(IndexShard.java:2519) ~[opensearch-2.18.0-SNAPSHOT.jar:2.18.0-SNAPSHOT]
opensearch-master1  |     at org.opensearch.index.shard.IndexShard.openEngineAndRecoverFromTranslog(IndexShard.java:2491) ~[opensearch-2.18.0-SNAPSHOT.jar:2.18.0-SNAPSHOT]
opensearch-master1  |     at org.opensearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:751) ~[opensearch-2.18.0-SNAPSHOT.jar:2.18.0-SNAPSHOT]
opensearch-master1  |     at org.opensearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:126) ~[opensearch-2.18.0-SNAPSHOT.jar:2.18.0-SNAPSHOT]
opensearch-master1  |     at org.opensearch.core.action.ActionListener.completeWith(ActionListener.java:344) ~[opensearch-core-2.18.0-SNAPSHOT.jar:2.18.0-SNAPSHOT]
opensearch-master1  |     ... 8 more
opensearch-master1  | [2024-09-18T10:54:33,059][WARN ][o.o.c.r.a.AllocationService] [opensearch-master1] Falling back to single shard assignment since batch mode disable or multiple custom allocators set
opensearch-master1  | [2024-09-18T10:54:33,061][INFO ][o.o.g.G.RemotePersistedState] [opensearch-master1] codec version is 4
opensearch-master1  | [2024-09-18T10:54:33,072][WARN ][o.o.c.r.a.AllocationService] [opensearch-master1] Falling back to single shard assignment since batch mode disable or multiple custom allocators set
opensearch-master1  | [2024-09-18T10:54:33,073][INFO ][o.o.g.G.RemotePersistedState] [opensearch-master1] codec version is 4
opensearch-master1  | [2024-09-18T10:54:33,084][WARN ][o.o.c.r.a.AllocationService] [opensearch-master1] Falling back to single shard assignment since batch mode disable or multiple custom allocators set
opensearch-master1  | [2024-09-18T10:54:33,084][INFO ][o.o.g.G.RemotePersistedState] [opensearch-master1] codec version is 4
opensearch-master1  | [2024-09-18T10:57:20,901][INFO ][o.o.g.r.RemoteClusterStateCleanupManager] [opensearch-master1] Cleaning up stale remote state files for cluster [opensearch-cluster] with uuid [lP5vHY8eTCmZxR3EVb5wGQ]. Last clean was done before 20 updates

Async translog

This can be reproduced similarly for async translog as well.

Expected behavior

In the happy flow, the close index works normally. In the edge cases, I discovered above, this should work normally and must not lead to red cluster.

Additional Details

NA

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Storage:Remote
Projects
Status: ✅ Done
Development

Successfully merging a pull request may close this issue.

1 participant