Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Segment Replication] Fix flaky tests testSegmentReplicationStatsResponse() and testSegmentReplicationStatsWithTimeout() #6268

Merged
merged 10 commits into from
Feb 21, 2023

Conversation

Rishikesh1159
Copy link
Member

@Rishikesh1159 Rishikesh1159 commented Feb 9, 2023

Signed-off-by: Rishikesh1159 rishireddy1159@gmail.com

Description

This PR :
-> Fixes flaky tests testSegmentReplicationStatsWithTimeout() and testSegmentReplicationStatsResponse().
-> Makes timingdata map to a concurrent map to handle concurrent modifications.
-> Moves timing data under detailed flag

Issues Resolved

Resolves #6261, #6255

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>
@github-actions
Copy link
Contributor

github-actions bot commented Feb 9, 2023

Gradle Check (Jenkins) Run Completed with:

private final Map<String, Long> timingData;

// Timing data will have as many entries as stages, plus one
private final Map<String, Long> timingData = new ConcurrentHashMap<>(Stage.values().length + 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we still want to init this in constructors only and send this map over the wire, otherwise the map will empty on the other end when querying stats. You can make the map concurrent in the ctor or copy it to write.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am going with copying the map and sending copy of map in output stream.

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.indices.replication.SegmentReplicationStatsIT.testSegmentReplicationStatsResponse

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT.testIndexCreateBlockWithAReadOnlyBlock

@codecov-commenter
Copy link

codecov-commenter commented Feb 10, 2023

Codecov Report

Merging #6268 (ba41b87) into main (1904158) will decrease coverage by 0.29%.
The diff coverage is 2.94%.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

@@             Coverage Diff              @@
##               main    #6268      +/-   ##
============================================
- Coverage     70.82%   70.53%   -0.29%     
+ Complexity    59078    58782     -296     
============================================
  Files          4799     4799              
  Lines        282434   282441       +7     
  Branches      40717    40716       -1     
============================================
- Hits         200041   199232     -809     
- Misses        65977    66780     +803     
- Partials      16416    16429      +13     
Impacted Files Coverage Δ
...cation/TransportSegmentReplicationStatsAction.java 11.11% <0.00%> (ø)
...st/action/cat/RestCatSegmentReplicationAction.java 48.97% <0.00%> (-10.21%) ⬇️
...h/indices/replication/SegmentReplicationState.java 44.60% <4.16%> (-2.37%) ⬇️
...search/aggregations/pipeline/ParsedDerivative.java 0.00% <0.00%> (-79.17%) ⬇️
...main/java/org/opensearch/common/network/Cidrs.java 0.00% <0.00%> (-77.78%) ⬇️
...in/java/org/opensearch/test/MockKeywordPlugin.java 0.00% <0.00%> (-75.00%) ⬇️
...rch/search/query/QueryPhaseExecutionException.java 0.00% <0.00%> (-66.67%) ⬇️
...n/admin/cluster/node/tasks/get/GetTaskRequest.java 30.30% <0.00%> (-63.64%) ⬇️
...a/org/opensearch/client/cluster/ProxyModeInfo.java 0.00% <0.00%> (-60.00%) ⬇️
...rg/apache/lucene/queries/SpanMatchNoDocsQuery.java 42.85% <0.00%> (-57.15%) ⬇️
... and 524 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>
Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

// index 10 docs
for (int i = 0; i < 10; i++) {
client().prepareIndex(INDEX_NAME).setId(Integer.toString(i)).setSource("field", "value" + i).execute().actionGet();
public void testSegmentReplicationStatsResponse() {
Copy link
Member

@dreamer-89 dreamer-89 Feb 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Rishikesh1159 : I also fixed this flaky test as part of #6370. Please rebase your changes against main ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure let me rebase

Copy link
Member

@dreamer-89 dreamer-89 Feb 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, I see one run where this is still flaky #6366 (comment).
Previously, I root caused the problem happening due to race condition and fixed by asserting on replication state to be done which fixed it from 20% failing -> 0%. The assertBusy block by default waits for 10s not sure if segment replication needs more time to complete. May be we can increase the assertBusy timeout to 60secs and check if this test is still flaky.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes testSegmentReplicationStatsResponse() was flaky previously because of stage not reaching done within certain time. But now I have changed the test completely. As we are only testing API response and we are concerned only about API response but not about segment replication stage. So I changed the test and we no more assert on segrep stage which was causing flakiness.

Copy link
Member Author

@Rishikesh1159 Rishikesh1159 Feb 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With changes in testSegmentReplicationStatsResponse() we only assert if we get a successful or failure response from API call. We are not concerned about if segment replication event has finished processing or not, as purpose of API is only to give response irrespective of state of segrep event.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is still value in asserting the replication state as it can uncover bugs where replication never completes for various reasons and it is also part of API response. I will leave it for you to decide.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can add asserts on replication state. It doesn't break anything but as you said there is a chance it can uncover some bugs. I have added that in latest commit. Thanks @dreamer-89

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@@ -121,6 +131,10 @@ public void testSegmentReplicationStatsResponseForActiveAndCompletedOnly() throw
completedOnlyResponse.shardSegmentReplicationStates().get(INDEX_NAME).get(0).getStage(),
SegmentReplicationState.Stage.DONE
);
assertThat(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use assertEquals instead of assertThat here. This is applicable for other usages of assertThat as well

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure I will update all usages

// index 10 docs
for (int i = 0; i < 10; i++) {
client().prepareIndex(INDEX_NAME).setId(Integer.toString(i)).setSource("field", "value" + i).execute().actionGet();
public void testSegmentReplicationStatsResponse() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is still value in asserting the replication state as it can uncover bugs where replication never completes for various reasons and it is also part of API response. I will leave it for you to decide.

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@@ -140,7 +140,7 @@ protected SegmentReplicationState shardOperation(SegmentReplicationStatsRequest
singleIndexWithSegmentReplicationDisabled = shardRouting.getIndexName();
return null;
}
if (indexShard.indexSettings().isSegRepEnabled() == false) {
if (indexShard.indexSettings().isSegRepEnabled() == false || shardRouting.primary()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change is needed ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If shard is primary we don't even need to make a call to get segment replication state, as segment replication only happens on replica shards. If we don't have this primary check here, it will make a call to get segment replication state and check some collections and eventually return null on a primary shard. So with this addition of condition check we are just returning null early if it is a primary sahrd

}
refresh(INDEX_NAME);
waitForSearchableDocs(10L, asList(primaryNode, replicaNode));
ensureSearchable(INDEX_NAME);

assertBusy(() -> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned here, seen one instance where this test failed. May be we can increase the assertBusy timeout and see if it helps.

Copy link
Member Author

@Rishikesh1159 Rishikesh1159 Feb 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah sure. May be I can increase it to 60sec just to be certain that replication has completed as you said before.

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      2 org.opensearch.cluster.service.MasterServiceTests.classMethod
      1 org.opensearch.cluster.service.MasterServiceTests.testThrottlingForMultipleTaskTypes

@Rishikesh1159
Copy link
Member Author

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      2 org.opensearch.cluster.service.MasterServiceTests.classMethod
      1 org.opensearch.cluster.service.MasterServiceTests.testThrottlingForMultipleTaskTypes

Issue already opened for these flaky tests #5958

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Flaky test SegmentReplicationStatsIT.testSegmentReplicationStatsResponse
4 participants