Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cloud_storage: make cluster default read/write properties "sticky" #6950

Merged
merged 15 commits into from
Nov 2, 2022

Conversation

jcsp
Copy link
Contributor

@jcsp jcsp commented Oct 26, 2022

Cover letter

Instead of treating remote.read and remote.write as properties that can be overridden by topics or can be null and thereby get cluster defaults (including updating when those defaults change), we will now always store an explicit value for the properties on the topic, and the cluster defaults will simply control what that explicit value defaults to at time of topic creation (or at time of upgrade, for clusters coming from <= 22.2)

This is based on #6876

Fixes #6917

Backport Required

  • not a bug fix
  • issue does not exist in previous branches
  • papercut/not impactful enough to backport
  • v22.2.x
  • v22.1.x
  • v21.11.x

UX changes

See release notes

Release notes

Improvements

  • The properties cloud_storage_enable_remote_[read|write] are now applied to topics at creation time, and if they subsequently change, then existing topics' properties do not change. To modify the tiered storage mode of existing topics, you may set the redpanda.remote.[read|write] properties on the topic.

@jcsp jcsp added kind/tech-debt area/cloud-storage Shadow indexing subsystem labels Oct 26, 2022
@jcsp jcsp force-pushed the issue-6917-sticky-properties branch 2 times, most recently from 73ae5af to 94647fc Compare October 27, 2022 13:22
Copy link
Contributor

@VladLazar VladLazar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good. I think we should update disk_log_impl::override_retention_config to remove the usage cloud_storage_enable_remote_write and use whatever is in the ntp_config. Might be worth waiting for #6951 to merge though.

@jcsp
Copy link
Contributor Author

jcsp commented Oct 28, 2022

I think we should update disk_log_impl::override_retention_config to remove the usage cloud_storage_enable_remote_write and use whatever is in the ntp_config

Agreed, added a commit that does that and rebased on dev since #6951 merged

@jcsp jcsp force-pushed the issue-6917-sticky-properties branch from f7c3a39 to 9dd6401 Compare October 28, 2022 12:30
@jcsp
Copy link
Contributor Author

jcsp commented Oct 28, 2022

Oops, the last force push picked up some unrelated commits, re-pushed.

@jcsp jcsp marked this pull request as ready for review October 28, 2022 12:38
@jcsp jcsp force-pushed the issue-6917-sticky-properties branch from 9dd6401 to 5d72c7d Compare October 31, 2022 12:59
@jcsp
Copy link
Contributor Author

jcsp commented Oct 31, 2022

Force push: update test_bytes_eviction_overrides for the way we no longer take topic SI enablement at the storage layer from the global config, this PR raced with the one that added it

@VladLazar
Copy link
Contributor

Code looks fine, but there's two tests failing:

  • ShadowIndexingRetentionTesttest_shadow_indexing_default_local_retention: The logic for computing whether a deletion is expected needs updating to reflect the new behaviour:
    expect_deletion = cluster_remote_write if topic_remote_write == "-1" else topic_remote_write == "true"
  • UpgradeFromPriorFeatureVersionCloudStorageTest.test_rolling_upgrade also fails, but I'm not sure why

Vlad Lazar and others added 12 commits November 1, 2022 17:47
This commit changes the code that determines what shadow indexing mode
to use when a topic is created. Previously, if no remote write or read
configs were specified via topic configs, the mode was left unitiliased
and other subsystems decided its value independently based on cluster
level configs. Now we always initiliase such with the cluster configs as
defaults.
This commit removes all usages of the remote read/write cluster configs
in the archival subsystem. Instead, we now always use whatever
configuration is stored in the topic table as the source of truth.
Essentially, the cluster level configs act as defaults and the topic
level configs are overrides.
This commit updates the description of the remote read and write cluster
level configs to reflect their new behaviour: defaults at topic
creation.
We no longer need to check the global defaults at this
point: during the upgrade migration, the SI mode will
have been stored in the topic.
These are now simply read during topic creation (or upgrade),
so runtime changes are handled implicitly.
Previously this could generate spurious "Nodes report
restart required" messages if the cluster was newley restarted,
and some nodes were slow to get config statuses, because the
final check (for restart status) uses a different call into
the admin API than the earlier check for config version.

Fix + simplify by using a wait_until_result to keep the last
polled config status for the final check on restart status.
This validates behavior in 22.3 where the current state of
cluster-wide tiered storage settings is applied persistently to
each topic that exists at the time of upgrade.
Test ShadowIndexingGlobalConfig.test_overrides_set was setting
remote.read to false twice, instead of setting remote.read to false
and remote.write to false.  Apparently this used to somehow
result in a subsequent describe seeing false for both properties,
but that seems like a bug.
...for sticky tiered storage configuration: the defaults
at cluster level now only apply at topic creation time,
and remain set on that topic thereafter.

Related: redpanda-data#6917
…on_active

We now explicitly set cloud storage settings in the topic config,
the cluster defaults don't matter.
it is no longer the case that a topic with
remote.write=false will have shadow indexing
enabled when the cluster property is true.
This is a subtle upgrade case.  When a 22.3 node starts and
has not yet executed the migration to move tiered storage
enablement flags from global defaults onto the topic, the
archival_metadata_stm::max_collectible_offset method
will naively return offset::max (i.e. giving storage-gc
permission to proceed without waiting for uploads).

Fix this by reinstating the legacy behavior (of assuming
tiered storage is enabled if the global default is true)
conditionally in the case where we are in the middle
of an upgrade (i.e. when 22.3 feature flag is not
yet active).
@jcsp jcsp force-pushed the issue-6917-sticky-properties branch from 5d72c7d to 2d8348d Compare November 1, 2022 17:47
@jcsp
Copy link
Contributor Author

jcsp commented Nov 1, 2022

Pushed:

  • Updated retention_policy_test for the way that enabling the cluster default property no longer forces SI on for every topic
  • Updated archival_metadata_stm logic for max_collectible_offset: this was behind the test_rolling_upgrade failure: it was a legit bug, where on a partly upgraded system we were prefix-truncating the local log ahead of the uploaded location (more in commit message)

Similar to the previous commit in archival state machine, when
the cluster is in an upgrading state, we need to apply old-style rules
about the SI flags, to avoid passing through a state where we
temporarily tell clients that there is no data earlier than the
raft contents (this manifests as a client doing list_offsets
during the upgrade and seeing their start_offset jump
forward to the raft logs's start offset)
@jcsp
Copy link
Contributor Author

jcsp commented Nov 1, 2022

Push:

  • One more commit to do the same upgrade handling trick in cluster::partition that I just did in archival state machine. Basically these are the places where our interpretation of enabled-ness changed, and we need to interpret ntp configs in the old style until upgrades are complete.

The failure for this last commit is in test_rolling_upgrade but not deterministic, reproduced when running lots of test runs in parallel on my workstation.

Copy link
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code lgtm

Copy link
Contributor

@VladLazar VladLazar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Nice catch on applying the old behaviour during upgrades.

@jcsp jcsp merged commit 578d034 into redpanda-data:dev Nov 2, 2022
@jcsp jcsp deleted the issue-6917-sticky-properties branch November 2, 2022 11:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cloud_storage: make cluster default read/write properties "sticky"
3 participants