cloud_storage: make cluster default read/write properties "sticky" #6950

jcsp · 2022-10-26T12:26:47Z

Cover letter

Instead of treating remote.read and remote.write as properties that can be overridden by topics or can be null and thereby get cluster defaults (including updating when those defaults change), we will now always store an explicit value for the properties on the topic, and the cluster defaults will simply control what that explicit value defaults to at time of topic creation (or at time of upgrade, for clusters coming from <= 22.2)

This is based on #6876

Fixes #6917

Backport Required

UX changes

See release notes

Release notes

Improvements

The properties cloud_storage_enable_remote_[read|write] are now applied to topics at creation time, and if they subsequently change, then existing topics' properties do not change. To modify the tiered storage mode of existing topics, you may set the redpanda.remote.[read|write] properties on the topic.

VladLazar

Looking good. I think we should update disk_log_impl::override_retention_config to remove the usage cloud_storage_enable_remote_write and use whatever is in the ntp_config. Might be worth waiting for #6951 to merge though.

jcsp · 2022-10-28T12:06:22Z

I think we should update disk_log_impl::override_retention_config to remove the usage cloud_storage_enable_remote_write and use whatever is in the ntp_config

Agreed, added a commit that does that and rebased on dev since #6951 merged

jcsp · 2022-10-28T12:30:40Z

Oops, the last force push picked up some unrelated commits, re-pushed.

jcsp · 2022-10-31T13:00:30Z

Force push: update test_bytes_eviction_overrides for the way we no longer take topic SI enablement at the storage layer from the global config, this PR raced with the one that added it

VladLazar · 2022-10-31T17:15:33Z

Code looks fine, but there's two tests failing:

ShadowIndexingRetentionTesttest_shadow_indexing_default_local_retention: The logic for computing whether a deletion is expected needs updating to reflect the new behaviour:
expect_deletion = cluster_remote_write if topic_remote_write == "-1" else topic_remote_write == "true"
UpgradeFromPriorFeatureVersionCloudStorageTest.test_rolling_upgrade also fails, but I'm not sure why

This commit changes the code that determines what shadow indexing mode to use when a topic is created. Previously, if no remote write or read configs were specified via topic configs, the mode was left unitiliased and other subsystems decided its value independently based on cluster level configs. Now we always initiliase such with the cluster configs as defaults.

This commit removes all usages of the remote read/write cluster configs in the archival subsystem. Instead, we now always use whatever configuration is stored in the topic table as the source of truth. Essentially, the cluster level configs act as defaults and the topic level configs are overrides.

This commit updates the description of the remote read and write cluster level configs to reflect their new behaviour: defaults at topic creation.

We no longer need to check the global defaults at this point: during the upgrade migration, the SI mode will have been stored in the topic.

These are now simply read during topic creation (or upgrade), so runtime changes are handled implicitly.

Fixes: redpanda-data#6917

Previously this could generate spurious "Nodes report restart required" messages if the cluster was newley restarted, and some nodes were slow to get config statuses, because the final check (for restart status) uses a different call into the admin API than the earlier check for config version. Fix + simplify by using a wait_until_result to keep the last polled config status for the final check on restart status.

This validates behavior in 22.3 where the current state of cluster-wide tiered storage settings is applied persistently to each topic that exists at the time of upgrade.

Test ShadowIndexingGlobalConfig.test_overrides_set was setting remote.read to false twice, instead of setting remote.read to false and remote.write to false. Apparently this used to somehow result in a subsequent describe seeing false for both properties, but that seems like a bug.

...for sticky tiered storage configuration: the defaults at cluster level now only apply at topic creation time, and remain set on that topic thereafter. Related: redpanda-data#6917

…on_active We now explicitly set cloud storage settings in the topic config, the cluster defaults don't matter.

it is no longer the case that a topic with remote.write=false will have shadow indexing enabled when the cluster property is true.

This is a subtle upgrade case. When a 22.3 node starts and has not yet executed the migration to move tiered storage enablement flags from global defaults onto the topic, the archival_metadata_stm::max_collectible_offset method will naively return offset::max (i.e. giving storage-gc permission to proceed without waiting for uploads). Fix this by reinstating the legacy behavior (of assuming tiered storage is enabled if the global default is true) conditionally in the case where we are in the middle of an upgrade (i.e. when 22.3 feature flag is not yet active).

jcsp · 2022-11-01T17:52:56Z

Pushed:

Updated retention_policy_test for the way that enabling the cluster default property no longer forces SI on for every topic
Updated archival_metadata_stm logic for max_collectible_offset: this was behind the test_rolling_upgrade failure: it was a legit bug, where on a partly upgraded system we were prefix-truncating the local log ahead of the uploaded location (more in commit message)

Similar to the previous commit in archival state machine, when the cluster is in an upgrading state, we need to apply old-style rules about the SI flags, to avoid passing through a state where we temporarily tell clients that there is no data earlier than the raft contents (this manifests as a client doing list_offsets during the upgrade and seeing their start_offset jump forward to the raft logs's start offset)

jcsp · 2022-11-01T19:21:07Z

Push:

One more commit to do the same upgrade handling trick in cluster::partition that I just did in archival state machine. Basically these are the places where our interpretation of enabled-ness changed, and we need to interpret ntp configs in the old style until upgrades are complete.

The failure for this last commit is in test_rolling_upgrade but not deterministic, reproduced when running lots of test runs in parallel on my workstation.

dotnwat

code lgtm

VladLazar

Looks good. Nice catch on applying the old behaviour during upgrades.

jcsp added kind/tech-debt area/cloud-storage Shadow indexing subsystem labels Oct 26, 2022

github-actions bot added the area/redpanda label Oct 26, 2022

jcsp force-pushed the issue-6917-sticky-properties branch 2 times, most recently from 73ae5af to 94647fc Compare October 27, 2022 13:22

jcsp mentioned this pull request Oct 27, 2022

Treat remote read and write cluster config as defaults #6713

Closed

6 tasks

VladLazar reviewed Oct 27, 2022

View reviewed changes

jcsp force-pushed the issue-6917-sticky-properties branch from 94647fc to f7c3a39 Compare October 28, 2022 12:05

github-actions bot added the area/build label Oct 28, 2022

jcsp force-pushed the issue-6917-sticky-properties branch from f7c3a39 to 9dd6401 Compare October 28, 2022 12:30

github-actions bot removed the area/build label Oct 28, 2022

jcsp marked this pull request as ready for review October 28, 2022 12:38

jcsp force-pushed the issue-6917-sticky-properties branch from 9dd6401 to 5d72c7d Compare October 31, 2022 12:59

VladLazar mentioned this pull request Oct 31, 2022

Assert failure in CreateTopicUpgradeTest.test_retention_upgrade_with_cluster_remote_write #7011

Closed

Vlad Lazar and others added 12 commits November 1, 2022 17:47

config: update description for remote r/w configs

fb0ea41

This commit updates the description of the remote read and write cluster level configs to reflect their new behaviour: defaults at topic creation.

cluster: update S3 deletion for sticky SI properties

352d3a2

We no longer need to check the global defaults at this point: during the upgrade migration, the SI mode will have been stored in the topic.

config: make remote_read/remote_write defaults live-settable

d5e3cc4

These are now simply read during topic creation (or upgrade), so runtime changes are handled implicitly.

features/migrators: make SI mode sticky on upgrade to 22.3

3dd4966

Fixes: redpanda-data#6917

tests: add test_cloud_storage_sticky_enablement_v22_2_to_v22_3

15ec15a

This validates behavior in 22.3 where the current state of cluster-wide tiered storage settings is applied persistently to each topic that exists at the time of upgrade.

tests: update test_shadow_indexing_mode_persistence

26961be

...for sticky tiered storage configuration: the defaults at cluster level now only apply at topic creation time, and remain set on that topic thereafter. Related: redpanda-data#6917

storage: ignore cloud_storage_enable_remote_write in is_cloud_retenti…

0ee3697

…on_active We now explicitly set cloud storage settings in the topic config, the cluster defaults don't matter.

tests: update retention_policy_test

e2078fe

it is no longer the case that a topic with remote.write=false will have shadow indexing enabled when the cluster property is true.

jcsp added 2 commits November 1, 2022 17:47

archival: carry a feature_table& on archival_metadata_stm

a7a6671

jcsp force-pushed the issue-6917-sticky-properties branch from 5d72c7d to 2d8348d Compare November 1, 2022 17:47

dotnwat reviewed Nov 2, 2022

View reviewed changes

VladLazar approved these changes Nov 2, 2022

View reviewed changes

jcsp merged commit 578d034 into redpanda-data:dev Nov 2, 2022

jcsp deleted the issue-6917-sticky-properties branch November 2, 2022 11:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cloud_storage: make cluster default read/write properties "sticky" #6950

cloud_storage: make cluster default read/write properties "sticky" #6950

jcsp commented Oct 26, 2022 •

edited

Loading

VladLazar left a comment

jcsp commented Oct 28, 2022

jcsp commented Oct 28, 2022

jcsp commented Oct 31, 2022

VladLazar commented Oct 31, 2022

jcsp commented Nov 1, 2022

jcsp commented Nov 1, 2022

dotnwat left a comment

VladLazar left a comment

cloud_storage: make cluster default read/write properties "sticky" #6950

cloud_storage: make cluster default read/write properties "sticky" #6950

Conversation

jcsp commented Oct 26, 2022 • edited Loading

Cover letter

Backport Required

UX changes

Release notes

Improvements

VladLazar left a comment

Choose a reason for hiding this comment

jcsp commented Oct 28, 2022

jcsp commented Oct 28, 2022

jcsp commented Oct 31, 2022

VladLazar commented Oct 31, 2022

jcsp commented Nov 1, 2022

jcsp commented Nov 1, 2022

dotnwat left a comment

Choose a reason for hiding this comment

VladLazar left a comment

Choose a reason for hiding this comment

jcsp commented Oct 26, 2022 •

edited

Loading