Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Segment Replication] Add background checkpoint sync making SegRep more resilient to network failure #10652

Open
mch2 opened this issue Oct 16, 2023 · 0 comments
Labels
enhancement Enhancement or improvement to existing feature or request Indexing:Replication Issues and PRs related to core replication framework eg segrep

Comments

@mch2
Copy link
Member

mch2 commented Oct 16, 2023

Is your feature request related to a problem? Please describe.
Today SegRep relies on a transport layer call from primary shards to replicas alerting them there are new segments to sync. When a replica finishes a sync it will report back to the primary shard of its completion. This ensures that primaries track the state of all of their replicas and use this to enforce SR backpressure if they fall too far behind.

These transport layer calls are made using a RetryableTransportClient, but if they were to outright fail it would require a future write and a primary refresh for the replica to fully sync. This would mean replicas may never catch up or know to sync.

SR pressure was implemented to help mitigate lagging replicas by blocking writes giving replicas time to catch up. However, in this scenario the replicas would never catch up and writes on the particular shard could be indefinitely blocked until a flush is triggered.

Describe the solution you'd like
As a safety mechanism add a background sync, similar to RetentionLease sync, from each primary that sends its latest checkpoint to its known stale replicas with the latest replication checkpoint and return the replica's current checkpoint to update tracking state.

Describe alternatives you've considered
Switch to a pull model from replicas. - #4577

@mch2 mch2 added enhancement Enhancement or improvement to existing feature or request untriaged labels Oct 16, 2023
@anasalkouz anasalkouz added Indexing:Replication Issues and PRs related to core replication framework eg segrep and removed untriaged labels Nov 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Indexing:Replication Issues and PRs related to core replication framework eg segrep
Projects
None yet
Development

No branches or pull requests

2 participants