-
Notifications
You must be signed in to change notification settings - Fork 580
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decommissioning can be very slow #9593
Comments
travisdowns
changed the title
Decomissioning can be very slow
Decommissioning can be very slow
Mar 21, 2023
piyushredpanda
added
kind/bug
Something isn't working
area/raft
and removed
kind/bug
Something isn't working
labels
Mar 26, 2023
This should be addressed by adding node-wide recovery throttling. |
Partially addressed with: #9992 |
Improvements in bandwidth saturation added via #10339 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Version & Environment
Redpanda version: v23.1.1
What went wrong?
I am decommissioning what I consider to be a moderately sized node: about ~9 TB of local data across ~240 partition replicas in one topic.
The decommissioning is going very slowly and the speed varies a lot from "slow" to "glacial". It seems likely that the decommissioning may take more than 12 hours. In this particular case, all the decommissioned partitions are being sent to the same node (not weird as this is a new node to replace the old node, so it's empty).
These boxes have ~1.6 GB/s disk bandwidth (and network BW is better than that) so at full speed the decommissioning could be as fast as 1.5 hours. We don't need to achieve full speed but the observed speed seems unnecessarily slow, and can turn some day 2 operations into day 2 and day 3 operations.
What should have happened instead?
Faster decommissioning.
How to reproduce the issue?
Additional information
The destination machine is very mostly unloaded during the process.
The decommissioning appears to go in batches: initially a large block of partitions ~50 are all decomm'ing at a similar rate and the speed is slow but not terrible: on the order of 300 MB/s. However, as partitions begin to finish up, new ones are not added and then there is a long period of only a few partitions decomissioning at a very slow speed.
Visually:
For about 30 minutes at the left we are decommissioning at about 280 MB/s but then it drops down and with fewer partitions left we spend nearly 90 minutes at about 26.1 MB/s. Finally there's a short period of only ~3 MB/s where I think 1 partition was left. Finally we start another batch of 50 partitions and the rate goes up to ~315 MB/s.
This pattern sort of repeats though the exact shape is different. If we assume an average speed of 150 MB/s this will take ~17 hours to decommission.
The node is very unloaded, close to zero reactor usage, nothing much going on besides the incoming partitions.
Bamboo link covering the time period of the first few hours of the decommission (some samples missing due to a cortex issue):
https://dev.bamboo.monitoring.dev.vectorized.cloud/grafana/d/vTJl4WBVz/redpanda-clusters-v1-for-15s-samples?orgId=1&from=1679412199363&to=1679428976063&var-node_shard=All&var-aggr_criteria=pod&var-data_cluster=amp13-3&var-node=192.168.1.96%3A9644&viewPanel=23763572007
Part of the problem is that even the "max rate" seems to be slow (300 MB/s), but another part is that we go through long very slow periods of 30 MB/s or less. That seems to be caused by decommissioning 50 partitions at once and not starting the next 50 until the batch of 50 are done: but if the partitions take different amounts of time (which seems to occur even if they are same size) there may be a long period with only a few partitions left in the batch and there seems to be a per-partition speed limit of single-digit MB/s.
Here's an example of a couple of partitions going slowly:
Note that nearly all the nodes are at 37% complete but a couple are 18 or 19% complete, almost exactly half the speed.
The text was updated successfully, but these errors were encountered: