Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

node_status_backend: reset backoff on peer checkin #11342

Merged
merged 4 commits into from
Jun 14, 2023

Conversation

bharathv
Copy link
Contributor

@bharathv bharathv commented Jun 12, 2023

When a peer restarts and a backoff is applied locally, it needs to be
reset once the peer is available again. Otherwise the transport does not
reconnect until the entire backoff elapses thus marking it unavailable
for downstream consumers like partition balancer.

Fixes #5795
Fixes #11307
Fixes #11276

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.1.x
  • v22.3.x
  • v22.2.x

Release Notes

  • none

@bharathv
Copy link
Contributor Author

/ci-repeat 4
skip-units
dt-repeat=10
tests/rptest/tests/partition_balancer_test.py
tests/rack_aware_replica_placement_test.py
tests/rptest/tests/leadership_transfer_test.py

VladLazar
VladLazar previously approved these changes Jun 12, 2023
Copy link
Contributor

@VladLazar VladLazar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change makes sense to me.

@mmaslankaprv
Copy link
Member

This change looks good, i am wondering if we shouldn't also introduce a configuration that would allow to setup the max backoff time for node_status connections

@bharathv bharathv self-assigned this Jun 12, 2023
@bharathv
Copy link
Contributor Author

i am wondering if we shouldn't also introduce a configuration that would allow to setup the max backoff time for node_status connections

Done.. network partition is an interesting situation..

When a peer restarts and a backoff is applied locally, it needs to be
reset once the peer is available again. Otherwise the transport does not
reconnect until the entire backoff elapses thus marking it unavailable
for downstream consumers like partition balancer.
Adds node_status_reconnect_max_backoff_ms cluster configuration
and defaults to 15s.
@bharathv
Copy link
Contributor Author

Last force pushed is to fix conflicts.

@mmaslankaprv
Copy link
Member

There seems to be a related failure:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/root/tests/rptest/services/cluster.py", line 79, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/node_status_test.py", line 141, in test_all_nodes_up
    status_graph.check_cluster_status()
  File "/root/tests/rptest/tests/node_status_test.py", line 98, in check_cluster_status
    assert is_live(
AssertionError: Expected node docker-rp-7 to be alive, but since_last_status > max_delta: ms_since_last_status=211, tolerance=125

@bharathv
Copy link
Contributor Author

umm looks like debug build strikes again.. taking a look.

Wrap it in a waiter. With debug builds the backend can potentially
take some additional time to reach the desired state, especially
after invalidating all the transports after resetting a backoff.
@vshtokman vshtokman merged commit 0ad8dd3 into redpanda-data:dev Jun 14, 2023
@bharathv bharathv deleted the rack_aw branch June 14, 2023 21:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants