Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Backport 2.x] Fix flakiness with SegmentReplicationSuiteIT #13180

Merged
merged 1 commit into from
Apr 16, 2024

Conversation

opensearch-trigger-bot[bot]
Copy link
Contributor

Backport e828c18 from #11977.

* Fix SegmentReplicationSuiteIT

This test fails because of a race during shard/node shutdown with node-node replication.
Fixed by properly synchronizing creation of new replication events with cancellation and cancelling
after shards are closed.

Signed-off-by: Marc Handalian <marc.handalian@gmail.com>

* Remove CopyState caching from OngoingSegmentReplications.

This change removes the responsibility of caching CopyState inside of OngoingSegmentReplications.
1. CopyState was originally cached to prevent frequent disk reads while building segment metadata.  This is now
cached lower down in IndexShard and is not required here.
2. Change prepareForReplication method to return SegmentReplicationSourceHandler directly
3. Move responsibility of creating and clearing CopyState to the handler.

Signed-off-by: Marc Handalian <marc.handalian@gmail.com>

* Fix comment for afterIndexShardClosed method.

Signed-off-by: Marc Handalian <marc.handalian@gmail.com>

* Fix comment on beforeIndexShardClosed

Signed-off-by: Marc Handalian <marc.handalian@gmail.com>

* Remove unnecessary method from OngoingSegmentReplications

Signed-off-by: Marc Handalian <marc.handalian@gmail.com>

---------

Signed-off-by: Marc Handalian <marc.handalian@gmail.com>
(cherry picked from commit e828c18)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Copy link
Contributor

Compatibility status:

Checks if related components are compatible with change 8432375

Incompatible components

Skipped components

Compatible components

Compatible components: [https://github.com/opensearch-project/custom-codecs.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/flow-framework.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/sql.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/performance-analyzer-rca.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/performance-analyzer.git]

Copy link
Contributor

❌ Gradle check result for 8432375: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for 8432375: ABORTED

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for 8432375: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@mch2
Copy link
Member

mch2 commented Apr 16, 2024

❌ Gradle check result for 8432375: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

2> REPRODUCE WITH: ./gradlew ':qa:rolling-upgrade:v1.3.16#upgradedClusterTest' --tests "org.opensearch.upgrades.RefreshVersionInClusterStateIT.testRefresh" -Dtests.seed=85D53F34E966099E -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=th -Dtests.timezone=Europe/Malta -Druntime.java=21
  2> org.opensearch.client.ResponseException: method [GET], host [http://[::1/]:36603], URI [/_cat/nodes?h=id,version], status line [HTTP/1.1 400 Bad Request]
    {"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"Values less than -1 bytes are not supported: -2b"}],"type":"illegal_argument_exception","reason":"Values less than -1 bytes are not supported: -2b"},"status":400}
        at __randomizedtesting.SeedInfo.seed([85D53F34E966099E:7F164C95CF8F7939]:0)
        at app//org.opensearch.client.RestClient.convertResponse(RestClient.java:376)
        at app//org.opensearch.client.RestClient.performRequest(RestClient.java:346)
        at app//org.opensearch.client.RestClient.performRequest(RestClient.java:321)
        at app//org.opensearch.upgrades.RefreshVersionInClusterStateIT.testRefresh(RefreshVersionInClusterStateIT.java:29)
  2> NOTE: leaving temporary files on disk at: /var/jenkins/workspace/gradle-check/search/qa/rolling-upgrade/build/testrun/v1.3.16#upgradedClusterTest/temp/org.opensearch.upgrades.RefreshVersionInClusterStateIT_85D53F34E966099E-001
  2> NOTE: test params are: codec=Asserting(Lucene99): {}, docValues:{}, maxPointsInLeafNode=110, maxMBSortInHeap=7.182415897638013, sim=Asserting(RandomSimilarity(queryNorm=true): {}), locale=th, timezone=Europe/Malta
  2> NOTE: Linux 5.15.0-1056-aws amd64/Eclipse Adoptium 21.0.2 (64-bit)/cpus=32,threads=1,free=419361992,total=536870912
  2> NOTE: All tests run in this JVM: [IndexingIT, JodaCompatibilityIT, MappingIT, MappingTypeRemovalIT, RecoveryIT, RefreshVersionInClusterStateIT]

This test is running bwc with 1.3x. SegRep tests don't run with 1.x - will tag this as flaky and re-run.

Copy link
Contributor

❌ Gradle check result for 8432375: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❕ Gradle check result for 8432375: UNSTABLE

  • TEST FAILURES:
      1 org.opensearch.cluster.allocation.ClusterRerouteIT.testDelayWithALargeAmountOfShards

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

@mch2
Copy link
Member

mch2 commented Apr 16, 2024

#13234

@mch2 mch2 merged commit 59e4eca into 2.x Apr 16, 2024
54 of 55 checks passed
@mch2 mch2 deleted the backport/backport-11977-to-2.x branch April 16, 2024 16:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant