[BUG] failing IT test : SegmentReplicationRelocationIT #6065

ashking94 · 2023-01-30T07:48:22Z

Describe the bug
https://build.ci.opensearch.org/job/gradle-check/10147/

To Reproduce
Steps to reproduce the behavior:

Go to '...'
Click on '....'
Scroll down to '....'
See error

Expected behavior
A clear and concise description of what you expected to happen.

Plugins
Please list all plugins currently enabled.

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

OS: [e.g. iOS]
Version [e.g. 22]

Additional context
Add any other context about the problem here.

ashking94 · 2023-01-30T11:06:37Z

Another occurrence - https://build.ci.opensearch.org/job/gradle-check/10155/

ashking94 · 2023-01-30T12:59:19Z

https://build.ci.opensearch.org/job/gradle-check/10160/#showFailuresLink

dreamer-89 · 2023-01-31T18:03:26Z

Other tests inside SegmentReplicationRelocationIT are also flaky. Muting them all. The test does not fail locally with provided seed irrespective of iteration count (passes 1000/1000).

The test on CI fails while waiting for cluster health response with pending operation, which probably depicts some load on cluster.

  java.lang.AssertionError: shard [test-idx-1][0] on node [node_t0] has pending operations:
     --> BulkShardRequest [[test-idx-1][0]] containing [index {[test-idx-1][349], source[{"field":"value349"}]}] blocking until refresh
    	at org.opensearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:250)

dreamer-89 · 2023-01-31T18:49:43Z

Next Steps:

Re-ran test with failing seed on ubuntu machine
Ran gradle check if above doesn't fail
Induce stress on machine to see if it helps in repro'ing issue. Using either stress utility or AWS FIS

mch2 · 2023-02-10T19:20:22Z

@dreamer-89 I took a look at testRelocateWhileContinuouslyIndexingAndWaitingForRefresh last night and found out the following:

The test fails consistently by timing out on this block after ACCEPTABLE_RELOCATION_TIME elapses.

       clusterHealthResponse = client().admin()
            .cluster()
            .prepareHealth()
            .setWaitForEvents(Priority.LANGUID)
            .setWaitForNoRelocatingShards(true)
            .setTimeout(ACCEPTABLE_RELOCATION_TIME)
            .execute()
            .actionGet();
        assertEquals(clusterHealthResponse.isTimedOut(), false);

With the following trace:


java.lang.AssertionError: 
Expected :true
Actual   :false
<Click to see difference>


	at org.junit.Assert.fail(Assert.java:89)
	at org.junit.Assert.failNotEquals(Assert.java:835)
	at org.junit.Assert.assertEquals(Assert.java:120)
	at org.junit.Assert.assertEquals(Assert.java:146)
	at org.opensearch.indices.replication.SegmentReplicationRelocationIT.testRelocateWhileContinuouslyIndexingAndWaitingForRefresh(SegmentReplicationRelocationIT.java:282)...

java.lang.AssertionError: shard [test-idx-1][0] on node [node_t0] has pending operations:
 --> BulkShardRequest [[test-idx-1][0]] containing [index {[test-idx-1][356], source[{"field":"value356"}]}] blocking until refresh
	at org.opensearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:250)
	at org.opensearch.index.shard.IndexShard.acquirePrimaryOperationPermit(IndexShard.java:3535)
	at org.opensearch.action.support.replication.TransportReplicationAction.acquirePrimaryOperationPermit(TransportReplicationAction.java:1189)
	at org.opensearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.doRun(TransportReplicationAction.java:477)

The test is using WAIT_UNTIL for every index req, causing the old primary to have pending operations waiting for a refresh, blocking relocation. The issue is the test uses WAIT_UNTIL, but also disables auto refresh within index settings with .put("index.refresh_interval", -1). So to fix this we either remove the block, use immediate or no RefreshPolicy in the req.

The second way this fails:

I chose to remove the index setting & refresh policy entirely, after that is updated, the test still fails with timeout, but with another Exception:

WARNING: Uncaught exception in thread: Thread[#83,opensearch[node_t2][generic][T#2],5,TGRP-SegmentReplicationRelocationIT]
java.lang.AssertionError: [id=M14Vtxm4RZidkqqnP7sj9g, rId=3Bl7vE1bTRmg4e6t62VLoA]local checkpoint [510] does not match checkpoint from primary context [PrimaryContext{clusterStateVersion=11, checkpoints={M14Vtxm4RZidkqqnP7sj9g=LocalCheckpointState{localCheckpoint=511, globalCheckpoint=510, inSync=true, tracked=true, replicated=true}, G0DOw7fqSNqnJ_s2j31d1Q=LocalCheckpointState{localCheckpoint=511, globalCheckpoint=510, inSync=true, tracked=true, replicated=true}, 3Bl7vE1bTRmg4e6t62VLoA=LocalCheckpointState{localCheckpoint=511, globalCheckpoint=510, inSync=true, tracked=true, replicated=true}}, routingTable=IndexShardRoutingTable([test-idx-1][0]){[test-idx-1][0], node[BD1h1HFLSdSJW_suc2PZtg], [R], s[STARTED], a[id=G0DOw7fqSNqnJ_s2j31d1Q], [test-idx-1][0], node[85ssPJmDSu6plgVo5hoQcQ], relocating [pUcgOAgXTISji7KPtEzzVA], [P], s[RELOCATING], a[id=3Bl7vE1bTRmg4e6t62VLoA, rId=M14Vtxm4RZidkqqnP7sj9g], expected_shard_size[208]}}]
	at __randomizedtesting.SeedInfo.seed([99322C710E2AD6C7]:0)
	at org.opensearch.index.shard.IndexShard.activateWithPrimaryContext(IndexShard.java:3079)
	at org.opensearch.indices.recovery.RecoveryTarget.handoffPrimaryContext(RecoveryTarget.java:261)
	at org.opensearch.indices.recovery.PeerRecoveryTargetService$HandoffPrimaryContextRequestHandler.messageReceived(PeerRecoveryTargetService.java:413)
	at org.opensearch.indices.recovery.PeerRecoveryTargetService$HandoffPrimaryContextRequestHandler.messageReceived(PeerRecoveryTargetService.java:407)
	at org.opensearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:106)
	at org.opensearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:453)
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:806)
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.base/java.lang.Thread.run(Thread.java:1589)

This is error is thrown after the new primary has done a round of segrep to sync with the old primary, flipped to a writeable engine, and recovered any missing ops from xlog. The issue here is the old primary ack'd 511 docs, meaning up to 511 is persisted in the xlog of the newPrimary. However, the global cp is still 510, and is what is used as the upper bound for number of docs to recover once the new primary is flipped.

Added some more detailed logs to show this...

[2023-02-10T13:31:42,073][INFO ][o.o.i.s.ReplicationTracker] [node_t0] [test-idx-1][0] updated local checkpoint of [M14Vtxm4RZidkqqnP7sj9g] from [510] to [511]
[2023-02-10T13:31:42,102][INFO ][o.o.i.s.IndexShard       ] [node_t0] [test-idx-1][0] Current checkpoint before force sync 502 ReplicationCheckpoint{shardId=[test-idx-1][0], primaryTerm=1, segmentsGen=3, seqNo=231, version=22}
[2023-02-10T13:31:42,227][INFO ][o.o.i.r.SegmentReplicationTargetService] [node_t2] [shardId 0] [AllocationId [id=M14Vtxm4RZidkqqnP7sj9g, rId=3Bl7vE1bTRmg4e6t62VLoA]] [replication id 11] Replication complete to checkpoint ReplicationCheckpoint{shardId=[test-idx-1][0], primaryTerm=1, segmentsGen=4, seqNo=231, version=22}, timing data: {INIT=0, OVERALL=93, GET_CHECKPOINT_INFO=21, FINALIZE_REPLICATION=39, FILE_DIFF=1, REPLICATING=0, GET_FILES=29}
[2023-02-10T13:31:42,235][INFO ][o.o.i.r.SegmentReplicationTargetService] [node_t2] Latest checkpoint before flip 388 ReplicationCheckpoint{shardId=[test-idx-1][0], primaryTerm=1, segmentsGen=4, seqNo=231, version=22} persisted 511 processed 231
[2023-02-10T13:31:42,294][INFO ][o.o.i.e.Engine           ] [node_t2] [test-idx-1][0] recovered maximum sequence number [511] and local checkpoint [231]
[2023-02-10T13:31:42,327][INFO ][o.o.i.t.InternalTranslogManager] [node_t2] recoverFromTranslog from 231 to 510
[2023-02-10T13:31:42,331][INFO ][o.o.i.t.InternalTranslogManager] [node_t2] Snapshot total ops 388
[2023-02-10T13:31:42,519][INFO ][o.o.i.t.InternalTranslogManager] [node_t2] Ops recovered? 279
[2023-02-10T13:31:42,540][INFO ][o.o.i.r.SegmentReplicationTargetService] [node_t2] Latest checkpoint after flip 388 ReplicationCheckpoint{shardId=[test-idx-1][0], primaryTerm=1, segmentsGen=6, seqNo=510, version=27} persisted 510 processed 510

The first line updates the old primary's replicationTracker to 511 for the new primary's allocaitonId, however once the new primary is flipped to a writeable engine, its processed cp is only 510. After the engine is flipped, the newPrimary's checkpoint tracker resets its processed & persisted seqNo's to the local cp, in this case 510.

Fix: There are a couple options here.
A. Force a refresh on oldPrimary so that all buffered docs are written to segments, and those are copied to the newPrimary.
B. When invoking resetEngineToGlobalCheckpoint we recover up to the max seqNo instead of the global cp. - this was done with #6314 for remote xlog, we can extend to also do this with SR.

I suggest we go with option A here - in the interest of avoiding reindexing at all costs, we flush the new segments and copy them over. We could go with option B as well as a safety net, or at a minimum add an assertion that the maxSeqNo is equal to the latest seqNo in the replication cp.

The third way this fails:

Once the newPrimary is active and recovered ops from the xlog, it is not publishing a new checkpoint automatically becuase at the time of refresh the shard is not in a valid state. This is causing the replica to not sync and be stale in the assertion. Logs, node_t2 is new primary, node_t0 is old.

[2023-02-10T14:42:38,145][INFO ][o.o.i.s.CheckpointRefreshListener] [node_t2] Skipping checkpoint publish didrefresh: false - shard state: RECOVERING - primaryMode: true
[2023-02-10T14:42:38,235][INFO ][o.o.i.s.CheckpointRefreshListener] [node_t0] Skipping checkpoint publish didrefresh: true - shard state: CLOSED - primaryMode: false
[2023-02-10T14:42:38,320][INFO ][o.o.i.s.CheckpointRefreshListener] [node_t2] Skipping checkpoint publish didrefresh: false - shard state: STARTED - primaryMode: true
[2023-02-10T14:42:38,325][INFO ][o.o.i.s.CheckpointRefreshListener] [node_t2] Skipping checkpoint publish didrefresh: false - shard state: STARTED - primaryMode: true

Fix options:
A. Force a flush on the new primary after it has finished recovering. This creates another bug in that the Primary from which to sync from on replicas is currently fetched from cluster state, and will return the old primary until updated. This will need to be handled with failure/retries or syncing only to the node that published the cp.
B. Ignore this and the replica will catch up on next doc insert, in the test just insert a new doc.

IMO we need to sync the replica with option A rather than depending on some future event to occur.

mch2 · 2023-02-14T18:55:17Z

Ok so after making all of these fixes, we still have a problem.

During relocation we block operations on primary, which will wait until all outstanding permits are released. If one of those permits is from a WAIT_UNTIL request, the relocation hangs indefinitely. This happens because wait_until does not complete until replicas have refreshed on a particular translog location, replicas with SR do not refresh at all until receiving a new set of segments, however at this point the primary is not refreshing, and cannot forcefully publish a checkpoint because publishCheckpoint requires an op permit on primary. The easy way to fix this is implement maybeRefresh method in NRTEngine to refresh to at least trigger its RefreshListeners, however this breaks the meaning of wait_until, bc its only waiting until the ops are durable, not visible.

I think the only way around this currently is to flip to a polling architecture instead of pushing/publishing checkpoints to replicas. I was able to get this going fairly quickly and these tests pass 100% of the time for me now... will put up a draft asap...

mch2 · 2023-02-14T23:06:00Z

@ashking94 FYI - #6315

mch2 · 2023-02-19T20:40:34Z

While polling fixes the WAIT_UNTIL issue above, it is a bit larger of a change that has its own considerations, namely how do we reduce traffic to primaries that are not actively indexing. I think polling could be the right solution here going fwd, but to allow us time to vet that thoroughly, a simpler change is in #6366. That change will simply avoid attempting to acquire a lock on the primary by not sending a publishCheckpoint request to itself, allowing replicas to receive checkpoints and proceed with running a round of replication.

To summarize the above issues.
1 - WAIT_UNTIL & (-1) refresh interval - This is fixed with #6366 - The relocation would not proceed because refreshes were never triggered blocking relocation.
2 - Fixed with #6352.
3 - WAIT_UNTIL blocking relocation - Fixed with #6366, SR rounds are able to complete to replicas, completing WAIT_UNTIL requests.
4 - Replicas not syncing to new primary after relocation completes - I have not seen this as an issue after #6366, because the new primary will be rerouted delayed WAIT_UNTIL requests that are then indexed & a new checkpoint is published.

We will need to still cover edge cases when publish requests fail. I am thinking a good solution to this is actually to use polling on the replica, but only trigger requests if we recognize the shard is behind (we will know this based on xlog noOp write) and we are not actively copying segments. We should also add retries + jitter to the publish request.

ashking94 added bug Something isn't working untriaged labels Jan 30, 2023

ashking94 changed the title ~~[BUG] Fix flaky test SegmentReplicationRelocationIT testRelocateWhileContinuouslyIndexingAndWaitingForRefresh~~ [BUG] failing IT test : SegmentReplicationRelocationIT testRelocateWhileContinuouslyIndexingAndWaitingForRefresh Jan 30, 2023

reta mentioned this issue Jan 30, 2023

Bumps Mockito from 4.7.0 to 5.1.0, ByteBuddy from 1.12.18 to 1.12.22 #6076

Merged

6 tasks

dreamer-89 mentioned this issue Jan 30, 2023

[Segment Replication] Mute testRelocateWhileContinuouslyIndexingAndWaitingForRefresh which … #6079

Merged

6 tasks

dreamer-89 removed the untriaged label Jan 30, 2023

dreamer-89 self-assigned this Jan 30, 2023

This was referenced Jan 30, 2023

[Meta] Segment Replication flaky test failures #5669

Closed

[Segment Replication] Mute SegmentRelocationITs failing/flaky on CI #6113

Merged

dreamer-89 changed the title ~~[BUG] failing IT test : SegmentReplicationRelocationIT testRelocateWhileContinuouslyIndexingAndWaitingForRefresh~~ [BUG] failing IT test : SegmentReplicationRelocationIT Jan 31, 2023

dreamer-89 mentioned this issue Jan 31, 2023

[AUTOCUT] Gradle Check Failure. #6112

Closed

dreamer-89 mentioned this issue Feb 1, 2023

[Segment Replication] For replica recovery, force segment replication sync from peer recovery source #5746

Merged

6 tasks

anasalkouz added the distributed framework label Feb 6, 2023

This was referenced Feb 14, 2023

[BUG] Data loss during primary relocation for segrep backed indexes #6315

Closed

[Draft] Segment Replication - Change replicas to poll for new segments. #6334

Closed

anasalkouz assigned mch2 Feb 16, 2023

This was referenced Feb 17, 2023

Segment Replication - Recover all translog ops when flipping to able engine #6352

Merged

Skip performOnPrimary step when executing PublishCheckpoint. #6366

Merged

mch2 closed this as completed in #6366 Feb 20, 2023

mch2 mentioned this issue Mar 13, 2023

[Segment Replication] Fix Flaky in Test SegmentReplicationRelocationIT #6637

Merged

6 tasks

ankitkala mentioned this issue Apr 26, 2023

Use pull based model for Segment Replication #4577

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] failing IT test : SegmentReplicationRelocationIT #6065

[BUG] failing IT test : SegmentReplicationRelocationIT #6065

ashking94 commented Jan 30, 2023

ashking94 commented Jan 30, 2023

ashking94 commented Jan 30, 2023

dreamer-89 commented Jan 31, 2023 •

edited

Loading

dreamer-89 commented Jan 31, 2023 •

edited

Loading

mch2 commented Feb 10, 2023 •

edited

Loading

mch2 commented Feb 14, 2023 •

edited

Loading

mch2 commented Feb 14, 2023

mch2 commented Feb 19, 2023 •

edited

Loading

[BUG] failing IT test : SegmentReplicationRelocationIT #6065

[BUG] failing IT test : SegmentReplicationRelocationIT #6065

Comments

ashking94 commented Jan 30, 2023

ashking94 commented Jan 30, 2023

ashking94 commented Jan 30, 2023

dreamer-89 commented Jan 31, 2023 • edited Loading

dreamer-89 commented Jan 31, 2023 • edited Loading

mch2 commented Feb 10, 2023 • edited Loading

mch2 commented Feb 14, 2023 • edited Loading

mch2 commented Feb 14, 2023

mch2 commented Feb 19, 2023 • edited Loading

dreamer-89 commented Jan 31, 2023 •

edited

Loading

dreamer-89 commented Jan 31, 2023 •

edited

Loading

mch2 commented Feb 10, 2023 •

edited

Loading

mch2 commented Feb 14, 2023 •

edited

Loading

mch2 commented Feb 19, 2023 •

edited

Loading