Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] failing IT test : SegmentReplicationRelocationIT #6065

Closed
ashking94 opened this issue Jan 30, 2023 · 8 comments · Fixed by #6366
Closed

[BUG] failing IT test : SegmentReplicationRelocationIT #6065

ashking94 opened this issue Jan 30, 2023 · 8 comments · Fixed by #6366
Assignees
Labels
bug Something isn't working distributed framework

Comments

@ashking94
Copy link
Member

Describe the bug
https://build.ci.opensearch.org/job/gradle-check/10147/

To Reproduce
Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior
A clear and concise description of what you expected to happen.

Plugins
Please list all plugins currently enabled.

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

  • OS: [e.g. iOS]
  • Version [e.g. 22]

Additional context
Add any other context about the problem here.

@ashking94 ashking94 added bug Something isn't working untriaged labels Jan 30, 2023
@ashking94 ashking94 changed the title [BUG] Fix flaky test SegmentReplicationRelocationIT testRelocateWhileContinuouslyIndexingAndWaitingForRefresh [BUG] failing IT test : SegmentReplicationRelocationIT testRelocateWhileContinuouslyIndexingAndWaitingForRefresh Jan 30, 2023
@ashking94
Copy link
Member Author

Another occurrence - https://build.ci.opensearch.org/job/gradle-check/10155/

@ashking94
Copy link
Member Author

@dreamer-89 dreamer-89 self-assigned this Jan 30, 2023
@dreamer-89 dreamer-89 changed the title [BUG] failing IT test : SegmentReplicationRelocationIT testRelocateWhileContinuouslyIndexingAndWaitingForRefresh [BUG] failing IT test : SegmentReplicationRelocationIT Jan 31, 2023
@dreamer-89
Copy link
Member

dreamer-89 commented Jan 31, 2023

Other tests inside SegmentReplicationRelocationIT are also flaky. Muting them all. The test does not fail locally with provided seed irrespective of iteration count (passes 1000/1000).

Screen Shot 2023-01-31 at 10 45 47 AM

The test on CI fails while waiting for cluster health response with pending operation, which probably depicts some load on cluster.

  java.lang.AssertionError: shard [test-idx-1][0] on node [node_t0] has pending operations:
     --> BulkShardRequest [[test-idx-1][0]] containing [index {[test-idx-1][349], source[{"field":"value349"}]}] blocking until refresh
    	at org.opensearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:250)

@dreamer-89
Copy link
Member

dreamer-89 commented Jan 31, 2023

Next Steps:

  1. Re-ran test with failing seed on ubuntu machine
  2. Ran gradle check if above doesn't fail
  3. Induce stress on machine to see if it helps in repro'ing issue. Using either stress utility or AWS FIS

@mch2
Copy link
Member

mch2 commented Feb 10, 2023

@dreamer-89 I took a look at testRelocateWhileContinuouslyIndexingAndWaitingForRefresh last night and found out the following:

  1. The test fails consistently by timing out on this block after ACCEPTABLE_RELOCATION_TIME elapses.
       clusterHealthResponse = client().admin()
            .cluster()
            .prepareHealth()
            .setWaitForEvents(Priority.LANGUID)
            .setWaitForNoRelocatingShards(true)
            .setTimeout(ACCEPTABLE_RELOCATION_TIME)
            .execute()
            .actionGet();
        assertEquals(clusterHealthResponse.isTimedOut(), false);

With the following trace:


java.lang.AssertionError: 
Expected :true
Actual   :false
<Click to see difference>


	at org.junit.Assert.fail(Assert.java:89)
	at org.junit.Assert.failNotEquals(Assert.java:835)
	at org.junit.Assert.assertEquals(Assert.java:120)
	at org.junit.Assert.assertEquals(Assert.java:146)
	at org.opensearch.indices.replication.SegmentReplicationRelocationIT.testRelocateWhileContinuouslyIndexingAndWaitingForRefresh(SegmentReplicationRelocationIT.java:282)...

java.lang.AssertionError: shard [test-idx-1][0] on node [node_t0] has pending operations:
 --> BulkShardRequest [[test-idx-1][0]] containing [index {[test-idx-1][356], source[{"field":"value356"}]}] blocking until refresh
	at org.opensearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:250)
	at org.opensearch.index.shard.IndexShard.acquirePrimaryOperationPermit(IndexShard.java:3535)
	at org.opensearch.action.support.replication.TransportReplicationAction.acquirePrimaryOperationPermit(TransportReplicationAction.java:1189)
	at org.opensearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.doRun(TransportReplicationAction.java:477)

The test is using WAIT_UNTIL for every index req, causing the old primary to have pending operations waiting for a refresh, blocking relocation. The issue is the test uses WAIT_UNTIL, but also disables auto refresh within index settings with .put("index.refresh_interval", -1). So to fix this we either remove the block, use immediate or no RefreshPolicy in the req.

The second way this fails:

I chose to remove the index setting & refresh policy entirely, after that is updated, the test still fails with timeout, but with another Exception:

WARNING: Uncaught exception in thread: Thread[#83,opensearch[node_t2][generic][T#2],5,TGRP-SegmentReplicationRelocationIT]
java.lang.AssertionError: [id=M14Vtxm4RZidkqqnP7sj9g, rId=3Bl7vE1bTRmg4e6t62VLoA]local checkpoint [510] does not match checkpoint from primary context [PrimaryContext{clusterStateVersion=11, checkpoints={M14Vtxm4RZidkqqnP7sj9g=LocalCheckpointState{localCheckpoint=511, globalCheckpoint=510, inSync=true, tracked=true, replicated=true}, G0DOw7fqSNqnJ_s2j31d1Q=LocalCheckpointState{localCheckpoint=511, globalCheckpoint=510, inSync=true, tracked=true, replicated=true}, 3Bl7vE1bTRmg4e6t62VLoA=LocalCheckpointState{localCheckpoint=511, globalCheckpoint=510, inSync=true, tracked=true, replicated=true}}, routingTable=IndexShardRoutingTable([test-idx-1][0]){[test-idx-1][0], node[BD1h1HFLSdSJW_suc2PZtg], [R], s[STARTED], a[id=G0DOw7fqSNqnJ_s2j31d1Q], [test-idx-1][0], node[85ssPJmDSu6plgVo5hoQcQ], relocating [pUcgOAgXTISji7KPtEzzVA], [P], s[RELOCATING], a[id=3Bl7vE1bTRmg4e6t62VLoA, rId=M14Vtxm4RZidkqqnP7sj9g], expected_shard_size[208]}}]
	at __randomizedtesting.SeedInfo.seed([99322C710E2AD6C7]:0)
	at org.opensearch.index.shard.IndexShard.activateWithPrimaryContext(IndexShard.java:3079)
	at org.opensearch.indices.recovery.RecoveryTarget.handoffPrimaryContext(RecoveryTarget.java:261)
	at org.opensearch.indices.recovery.PeerRecoveryTargetService$HandoffPrimaryContextRequestHandler.messageReceived(PeerRecoveryTargetService.java:413)
	at org.opensearch.indices.recovery.PeerRecoveryTargetService$HandoffPrimaryContextRequestHandler.messageReceived(PeerRecoveryTargetService.java:407)
	at org.opensearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:106)
	at org.opensearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:453)
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:806)
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.base/java.lang.Thread.run(Thread.java:1589)

This is error is thrown after the new primary has done a round of segrep to sync with the old primary, flipped to a writeable engine, and recovered any missing ops from xlog. The issue here is the old primary ack'd 511 docs, meaning up to 511 is persisted in the xlog of the newPrimary. However, the global cp is still 510, and is what is used as the upper bound for number of docs to recover once the new primary is flipped.

Added some more detailed logs to show this...

[2023-02-10T13:31:42,073][INFO ][o.o.i.s.ReplicationTracker] [node_t0] [test-idx-1][0] updated local checkpoint of [M14Vtxm4RZidkqqnP7sj9g] from [510] to [511]
[2023-02-10T13:31:42,102][INFO ][o.o.i.s.IndexShard       ] [node_t0] [test-idx-1][0] Current checkpoint before force sync 502 ReplicationCheckpoint{shardId=[test-idx-1][0], primaryTerm=1, segmentsGen=3, seqNo=231, version=22}
[2023-02-10T13:31:42,227][INFO ][o.o.i.r.SegmentReplicationTargetService] [node_t2] [shardId 0] [AllocationId [id=M14Vtxm4RZidkqqnP7sj9g, rId=3Bl7vE1bTRmg4e6t62VLoA]] [replication id 11] Replication complete to checkpoint ReplicationCheckpoint{shardId=[test-idx-1][0], primaryTerm=1, segmentsGen=4, seqNo=231, version=22}, timing data: {INIT=0, OVERALL=93, GET_CHECKPOINT_INFO=21, FINALIZE_REPLICATION=39, FILE_DIFF=1, REPLICATING=0, GET_FILES=29}
[2023-02-10T13:31:42,235][INFO ][o.o.i.r.SegmentReplicationTargetService] [node_t2] Latest checkpoint before flip 388 ReplicationCheckpoint{shardId=[test-idx-1][0], primaryTerm=1, segmentsGen=4, seqNo=231, version=22} persisted 511 processed 231
[2023-02-10T13:31:42,294][INFO ][o.o.i.e.Engine           ] [node_t2] [test-idx-1][0] recovered maximum sequence number [511] and local checkpoint [231]
[2023-02-10T13:31:42,327][INFO ][o.o.i.t.InternalTranslogManager] [node_t2] recoverFromTranslog from 231 to 510
[2023-02-10T13:31:42,331][INFO ][o.o.i.t.InternalTranslogManager] [node_t2] Snapshot total ops 388
[2023-02-10T13:31:42,519][INFO ][o.o.i.t.InternalTranslogManager] [node_t2] Ops recovered? 279
[2023-02-10T13:31:42,540][INFO ][o.o.i.r.SegmentReplicationTargetService] [node_t2] Latest checkpoint after flip 388 ReplicationCheckpoint{shardId=[test-idx-1][0], primaryTerm=1, segmentsGen=6, seqNo=510, version=27} persisted 510 processed 510

The first line updates the old primary's replicationTracker to 511 for the new primary's allocaitonId, however once the new primary is flipped to a writeable engine, its processed cp is only 510. After the engine is flipped, the newPrimary's checkpoint tracker resets its processed & persisted seqNo's to the local cp, in this case 510.

Fix: There are a couple options here.
A. Force a refresh on oldPrimary so that all buffered docs are written to segments, and those are copied to the newPrimary.
B. When invoking resetEngineToGlobalCheckpoint we recover up to the max seqNo instead of the global cp. - this was done with #6314 for remote xlog, we can extend to also do this with SR.

I suggest we go with option A here - in the interest of avoiding reindexing at all costs, we flush the new segments and copy them over. We could go with option B as well as a safety net, or at a minimum add an assertion that the maxSeqNo is equal to the latest seqNo in the replication cp.

The third way this fails:

Once the newPrimary is active and recovered ops from the xlog, it is not publishing a new checkpoint automatically becuase at the time of refresh the shard is not in a valid state. This is causing the replica to not sync and be stale in the assertion. Logs, node_t2 is new primary, node_t0 is old.

[2023-02-10T14:42:38,145][INFO ][o.o.i.s.CheckpointRefreshListener] [node_t2] Skipping checkpoint publish didrefresh: false - shard state: RECOVERING - primaryMode: true
[2023-02-10T14:42:38,235][INFO ][o.o.i.s.CheckpointRefreshListener] [node_t0] Skipping checkpoint publish didrefresh: true - shard state: CLOSED - primaryMode: false
[2023-02-10T14:42:38,320][INFO ][o.o.i.s.CheckpointRefreshListener] [node_t2] Skipping checkpoint publish didrefresh: false - shard state: STARTED - primaryMode: true
[2023-02-10T14:42:38,325][INFO ][o.o.i.s.CheckpointRefreshListener] [node_t2] Skipping checkpoint publish didrefresh: false - shard state: STARTED - primaryMode: true

Fix options:
A. Force a flush on the new primary after it has finished recovering. This creates another bug in that the Primary from which to sync from on replicas is currently fetched from cluster state, and will return the old primary until updated. This will need to be handled with failure/retries or syncing only to the node that published the cp.
B. Ignore this and the replica will catch up on next doc insert, in the test just insert a new doc.

IMO we need to sync the replica with option A rather than depending on some future event to occur.

@mch2
Copy link
Member

mch2 commented Feb 14, 2023

Ok so after making all of these fixes, we still have a problem.

During relocation we block operations on primary, which will wait until all outstanding permits are released. If one of those permits is from a WAIT_UNTIL request, the relocation hangs indefinitely. This happens because wait_until does not complete until replicas have refreshed on a particular translog location, replicas with SR do not refresh at all until receiving a new set of segments, however at this point the primary is not refreshing, and cannot forcefully publish a checkpoint because publishCheckpoint requires an op permit on primary. The easy way to fix this is implement maybeRefresh method in NRTEngine to refresh to at least trigger its RefreshListeners, however this breaks the meaning of wait_until, bc its only waiting until the ops are durable, not visible.

I think the only way around this currently is to flip to a polling architecture instead of pushing/publishing checkpoints to replicas. I was able to get this going fairly quickly and these tests pass 100% of the time for me now... will put up a draft asap...

@mch2
Copy link
Member

mch2 commented Feb 14, 2023

@ashking94 FYI - #6315

@mch2
Copy link
Member

mch2 commented Feb 19, 2023

While polling fixes the WAIT_UNTIL issue above, it is a bit larger of a change that has its own considerations, namely how do we reduce traffic to primaries that are not actively indexing. I think polling could be the right solution here going fwd, but to allow us time to vet that thoroughly, a simpler change is in #6366. That change will simply avoid attempting to acquire a lock on the primary by not sending a publishCheckpoint request to itself, allowing replicas to receive checkpoints and proceed with running a round of replication.

To summarize the above issues.
1 - WAIT_UNTIL & (-1) refresh interval - This is fixed with #6366 - The relocation would not proceed because refreshes were never triggered blocking relocation.
2 - Fixed with #6352.
3 - WAIT_UNTIL blocking relocation - Fixed with #6366, SR rounds are able to complete to replicas, completing WAIT_UNTIL requests.
4 - Replicas not syncing to new primary after relocation completes - I have not seen this as an issue after #6366, because the new primary will be rerouted delayed WAIT_UNTIL requests that are then indexed & a new checkpoint is published.

We will need to still cover edge cases when publish requests fail. I am thinking a good solution to this is actually to use polling on the replica, but only trigger requests if we recognize the shard is behind (we will know this based on xlog noOp write) and we are not actively copying segments. We should also add retries + jitter to the publish request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working distributed framework
Projects
Status: Done
4 participants