Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce flakiness of certain Common Test suites #10364

Merged
merged 14 commits into from
Jan 22, 2024
Merged

Conversation

kjnilsson
Copy link
Contributor

@kjnilsson kjnilsson commented Jan 18, 2024

A variety of flaky test improvement and smaller bug fixes:

Bug fixes:

  • Check that the rabbit app is running on the current node before modifying the stream coordinator cluster. This stops unwanted changes being triggered during shutdown
  • rabbit_nodes:list_running/0 returns [] whenever there is a failure so this PR also handles a few cases where that could cause unwanted side effects. This will be changed in a future version of the API.
  • The wait command now keeps trying if it reads the empty binary from the file (race condition)

It could help, we'll see.
During shutdown it is possible that the stream coordinator lives
beyond the khepri meta data store which causes rabbit_nodes:list_members/0
to return the empty list which in turn could cause the stream
coordinator to make incorrect cluster changes.

This commit handles those two cases.
per_message_ttl test woudl publish a message with a short ttl
 then assert on info counters. On a slow system it is possible
that the message expires before the test could observe the counter
change.
@michaelklishin michaelklishin changed the title Test reliability Reduce flakiness of certain Common Test suites Jan 19, 2024
As writing to a file isn't atomic between opening and writing this
can happen and would unnecessarily return the :garbage_in_pid_file
error.
Stream deletes aren't necessarily fully complete by the time the
queue.delete command returns as the stream coordinator will do this
work async. By using unique queue names we avoid the need to do
additional polling / waiting for teh delete operation to be
fully completed.
The leader_locator_balanced_random_maintenance test is effectively
using a plain random approach so we cannot assert that there
definitely would be leaders on both potential nodes only that there
aren't any leaders on the node that is in maintenance mode.
@kjnilsson kjnilsson marked this pull request as ready for review January 22, 2024 20:05
@michaelklishin
Copy link
Member

Awesome! Just out of curiosity, let's see if this backports to v3.12.x ;)

@michaelklishin michaelklishin merged commit c1d37e3 into main Jan 22, 2024
19 checks passed
@michaelklishin michaelklishin deleted the flaky-mc-flake-flake branch January 22, 2024 21:22
@michaelklishin michaelklishin added this to the 3.13.0 milestone Jan 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants