Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

config: add "legacy defaults" capability and decrease partitions per shard default #9197

Merged
merged 4 commits into from
Mar 2, 2023

Conversation

jcsp
Copy link
Contributor

@jcsp jcsp commented Mar 1, 2023

Historically it has been tough to change certain config defaults because it may be unsafe to e.g. decrease a limit if there are systems in the wild that are already beyond the new limit.

This PR adds a new mechanism where early in startup, the feature table notifies the configuration object of the original logical version, and the configuration properties may override their _default based on that version.

That mechanism is then used to reduce the partitions per shard default to 1000. This change is important because the current limits are unsafe, both for raw partition count issues, and also for the default limits on tiered storage reader limits, which are driven by the max partition count.

The main limitation to the mechanism used here is that it will only help with configs that are read after cluster bootstrap happens, or in the case of a node joining a cluster, after the new node sees initial controller log messages. This is suitable for the particular config being handled in this PR, but it is important that any future uses of the mechanism bear this in mind (there is a boldface comment by the definition of legacy_version struct)

Fixes #9179

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.1.x
  • v22.3.x
  • v22.2.x

Release Notes

Improvements

  • Newly created Redpanda clusters will apply a limit of 1000 partitions per shard (cluster configuration property topic_partitions_per_shard) by default. Legacy clusters will continue to use the legacy limit (7000), although it is recommended to modify configuration on legacy clusters to use a lower limit, to avoid risk of instability if excessive partitions are created.

@jcsp jcsp added kind/enhance New feature or request area/redpanda labels Mar 1, 2023
jcsp added 4 commits March 2, 2023 00:12
This is a mechanism for us to set improved default
values for cluster configuration properties, without
impacting already-deployed systems.  Systems with
an older original_version in the feature table will
apply the legacy default.
This enables the configuration object to apply any
special legacy-only defaults
...while leaving a legacy default to retain the old
value for old clusters, in case anyone was running
a special system with more partitions than this (e.g.
on very fast CPU cores that can handle it).
This relied on the legacy 7000 partitions
per shard default.
@jcsp jcsp marked this pull request as ready for review March 2, 2023 09:47
@@ -88,16 +88,18 @@ def test_cpu_limited(self):

rpk = RpkTool(self.redpanda)

# Three nodes, each with 1 core, 7000 partition-replicas
# per core, so with replicas=3, 7000 partitions should be the limit
# Three nodes, each with 1 core, 1000 partition-replicas
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily this test, but it'd be great to add an upgrade test that validates that this config doesn't change over jumping from 23.1.1.


// An alternative default that applies if the cluster's original logical
// version is <= the defined version
const std::optional<legacy_default<T>> _legacy_default;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't have to be now, but I can imagine this eventually evolving into a list of defaults as we continue to tune certain defaults.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, should be quick to add as/when we need it.

@jcsp jcsp requested a review from andrwng March 2, 2023 21:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/redpanda kind/enhance New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CI Failure (NodeCrash due to bad_alloc) in ShadowIndexingCacheSpaceLeakTest.test_si_cache
2 participants