[RFC] Using Segment Replication for Cross Cluster Replication #550

ankitkala · 2022-09-09T11:40:22Z

What are you proposing?

Current implementation of CCR uses logical replication where we replay all the leader shard’s operations on the follower's primary shard. With the ongoing effort for Segment Replication, local replica will simply sync the segments stored on the disk from primary shard offering significantly better throughput (documented here (opensearch-project/OpenSearch#2229)). This documents proposes the design for Cross Cluster Replication using the Segment Replication.

What problems are you trying to solve?

This change should enable the support for Segment Replication for CCR use cases.
Pros:

Lower CPU utilisation on follower cluster as follower doesn't need to execute the same operations again.
Pause and resume:
- Ability to pause & resume even after 12 hours: With logical, we only support resume till 12 hours(retention lease ttl). After 12 hours, since the translog operations on leader shard won't be available, user has to to delete all the data on follower and restart the replication. This issue will become obsolete with Segment replication.
- Easy pause/resume on ingestion heavy cluster: With logical replication, after resume, follower cluster can have a huge backlog of translog operations to be replayed and can struggle to cope up with leader. It can also make the cluster unstable as the operations are fetched on follower and executed. With segment replication, we can resume anytime and follower cluster will only request diff in segments.
Lower overhead on Leader cluster: With Segment replication and remote storage integration, CCR can work with almost zero overhead on leader cluster where follower directly communicates with follower for fetching the operations.

Cons:

Higher network utilisation due to Segment merges.
Might not be able to do Active-Active replication with Segment Replication. We can still keep supporting logical replication for this though.
Reliance on refresh interval can affect the overall performance.

What is the user experience going to be?

There should be no change in the user experience. Customer would still be using existing CCR APIs for configuring the replication. Depending on the replication model used between primary and replica on the leader cluster, CCR will rely on the same.

Why should it be built? Any reason not to?

Reasons to build:

All the "Pros" mentioned above.
Aligned with the OpenSearch's strategy for local replication.

Reasons not to build:

We already have logical replication for CCR and can also explore investing the efforts into improving that instead.

What will it take to execute?

[OpenSearch] Refactor the existing Segment Replication implementation to use pull model(reference).
[OpenSearch] Add support for Segment copy from remote clusters.
[Cross-cluster replication]Introduce support for Segment Replication in CCR plugin()

ankitkala added enhancement New feature or request roadmap labels Sep 9, 2022

ankitkala changed the title ~~[PLACEHOLDER][RFC] Using Segment Replication for Cross Cluster Replication~~ [RFC] Using Segment Replication for Cross Cluster Replication Sep 9, 2022

ankitkala closed this as completed Aug 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Using Segment Replication for Cross Cluster Replication #550

[RFC] Using Segment Replication for Cross Cluster Replication #550

ankitkala commented Sep 9, 2022 •

edited

Loading

[RFC] Using Segment Replication for Cross Cluster Replication #550

[RFC] Using Segment Replication for Cross Cluster Replication #550

Comments

ankitkala commented Sep 9, 2022 • edited Loading

What are you proposing?

What problems are you trying to solve?

What is the user experience going to be?

Why should it be built? Any reason not to?

What will it take to execute?

ankitkala commented Sep 9, 2022 •

edited

Loading