config: add "legacy defaults" capability and decrease partitions per shard default #9197

jcsp · 2023-03-01T15:54:32Z

Historically it has been tough to change certain config defaults because it may be unsafe to e.g. decrease a limit if there are systems in the wild that are already beyond the new limit.

This PR adds a new mechanism where early in startup, the feature table notifies the configuration object of the original logical version, and the configuration properties may override their _default based on that version.

That mechanism is then used to reduce the partitions per shard default to 1000. This change is important because the current limits are unsafe, both for raw partition count issues, and also for the default limits on tiered storage reader limits, which are driven by the max partition count.

The main limitation to the mechanism used here is that it will only help with configs that are read after cluster bootstrap happens, or in the case of a node joining a cluster, after the new node sees initial controller log messages. This is suitable for the particular config being handled in this PR, but it is important that any future uses of the mechanism bear this in mind (there is a boldface comment by the definition of legacy_version struct)

Fixes #9179

Backports Required

Release Notes

Improvements

Newly created Redpanda clusters will apply a limit of 1000 partitions per shard (cluster configuration property topic_partitions_per_shard) by default. Legacy clusters will continue to use the legacy limit (7000), although it is recommended to modify configuration on legacy clusters to use a lower limit, to avoid risk of instability if excessive partitions are created.

This is a mechanism for us to set improved default values for cluster configuration properties, without impacting already-deployed systems. Systems with an older original_version in the feature table will apply the legacy default.

This enables the configuration object to apply any special legacy-only defaults

...while leaving a legacy default to retain the old value for old clusters, in case anyone was running a special system with more partitions than this (e.g. on very fast CPU cores that can handle it).

This relied on the legacy 7000 partitions per shard default.

andrwng · 2023-03-02T17:48:41Z

tests/rptest/tests/resource_limits_test.py

@@ -88,16 +88,18 @@ def test_cpu_limited(self):

        rpk = RpkTool(self.redpanda)

-        # Three nodes, each with 1 core, 7000 partition-replicas
-        # per core, so with replicas=3, 7000 partitions should be the limit
+        # Three nodes, each with 1 core, 1000 partition-replicas


Not necessarily this test, but it'd be great to add an upgrade test that validates that this config doesn't change over jumping from 23.1.1.

andrwng · 2023-03-02T17:49:48Z

src/v/config/property.h

+
+    // An alternative default that applies if the cluster's original logical
+    // version is <= the defined version
+    const std::optional<legacy_default<T>> _legacy_default;


Doesn't have to be now, but I can imagine this eventually evolving into a list of defaults as we continue to tune certain defaults.

Yeah, should be quick to add as/when we need it.

jcsp added kind/enhance New feature or request area/redpanda labels Mar 1, 2023

jcsp added 4 commits March 2, 2023 00:12

config: enable setting specialized legacy defaults

7147f4b

This is a mechanism for us to set improved default values for cluster configuration properties, without impacting already-deployed systems. Systems with an older original_version in the feature table will apply the legacy default.

features: notify configuration when setting original version

d24e23d

This enables the configuration object to apply any special legacy-only defaults

config: change default partitions per shard to 1000

3e3975b

...while leaving a legacy default to retain the old value for old clusters, in case anyone was running a special system with more partitions than this (e.g. on very fast CPU cores that can handle it).

tests: update resource_limits_test

04c076e

This relied on the legacy 7000 partitions per shard default.

jcsp force-pushed the config-legacy-defaults branch from 4fdc45f to 04c076e Compare March 2, 2023 00:15

jcsp marked this pull request as ready for review March 2, 2023 09:47

jcsp requested review from dotnwat and mmaslankaprv March 2, 2023 09:49

andrwng reviewed Mar 2, 2023

View reviewed changes

jcsp requested a review from andrwng March 2, 2023 21:40

andrwng approved these changes Mar 2, 2023

View reviewed changes

jcsp merged commit d3f23a2 into redpanda-data:dev Mar 2, 2023

jcsp deleted the config-legacy-defaults branch March 2, 2023 21:43

andrwng mentioned this pull request Mar 14, 2023

CI Failure (NodeCrash: OOM in continuous_batch_parser) in ShadowIndexingManyPartitionsTest.test_many_partitions_shutdown #9375

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

config: add "legacy defaults" capability and decrease partitions per shard default #9197

config: add "legacy defaults" capability and decrease partitions per shard default #9197

jcsp commented Mar 1, 2023 •

edited

Loading

andrwng Mar 2, 2023

andrwng Mar 2, 2023

jcsp Mar 2, 2023

config: add "legacy defaults" capability and decrease partitions per shard default #9197

config: add "legacy defaults" capability and decrease partitions per shard default #9197

Conversation

jcsp commented Mar 1, 2023 • edited Loading

Backports Required

Release Notes

Improvements

andrwng Mar 2, 2023

Choose a reason for hiding this comment

andrwng Mar 2, 2023

Choose a reason for hiding this comment

jcsp Mar 2, 2023

Choose a reason for hiding this comment

jcsp commented Mar 1, 2023 •

edited

Loading