Reduce flakiness of certain Common Test suites #10364

kjnilsson · 2024-01-18T13:23:48Z

A variety of flaky test improvement and smaller bug fixes:

Bug fixes:

Check that the rabbit app is running on the current node before modifying the stream coordinator cluster. This stops unwanted changes being triggered during shutdown
rabbit_nodes:list_running/0 returns [] whenever there is a failure so this PR also handles a few cases where that could cause unwanted side effects. This will be changed in a future version of the API.
The wait command now keeps trying if it reads the empty binary from the file (race condition)

fix

It could help, we'll see.

During shutdown it is possible that the stream coordinator lives beyond the khepri meta data store which causes rabbit_nodes:list_members/0 to return the empty list which in turn could cause the stream coordinator to make incorrect cluster changes. This commit handles those two cases.

per_message_ttl test woudl publish a message with a short ttl then assert on info counters. On a slow system it is possible that the message expires before the test could observe the counter change.

As writing to a file isn't atomic between opening and writing this can happen and would unnecessarily return the :garbage_in_pid_file error.

Stream deletes aren't necessarily fully complete by the time the queue.delete command returns as the stream coordinator will do this work async. By using unique queue names we avoid the need to do additional polling / waiting for teh delete operation to be fully completed.

The leader_locator_balanced_random_maintenance test is effectively using a plain random approach so we cannot assert that there definitely would be leaders on both potential nodes only that there aren't any leaders on the node that is in maintenance mode.

michaelklishin · 2024-01-22T21:22:22Z

Awesome! Just out of curiosity, let's see if this backports to v3.12.x ;)

Test reliability

9a71595

fix

kjnilsson force-pushed the flaky-mc-flake-flake branch from 1b486cb to 9a71595 Compare January 18, 2024 14:47

kjnilsson added 6 commits January 18, 2024 16:06

Try a little sleep in an mqtt test

7e2f148

It could help, we'll see.

Add a log msg and hope it fails again

f137dbd

Handle 2 cases of where rabbit_nodes:list_members/0 could return []

f71c9dd

flake: improve racy test in quorum_queue_SUITE

5665880

per_message_ttl test woudl publish a message with a short ttl then assert on info counters. On a slow system it is possible that the message expires before the test could observe the counter change.

Streams tests: no need to delete test queue if already deleted.

77e15dc

michaelklishin changed the title ~~Test reliability~~ Reduce flakiness of certain Common Test suites Jan 19, 2024

kjnilsson added 4 commits January 22, 2024 15:27

protocol_interop_SUITE - try a durable queue for amqp part

c10b4dc

rabbit_stream_queue:recover try a flush for info

0a814e9

garbage

5266902

Handle case where queue info online returns []

3d74945

kjnilsson force-pushed the flaky-mc-flake-flake branch from 4480e13 to 3d74945 Compare January 22, 2024 15:27

Wait command: loop when file read returns the empty binary.

60f9f3c

As writing to a file isn't atomic between opening and writing this can happen and would unnecessarily return the :garbage_in_pid_file error.

kjnilsson force-pushed the flaky-mc-flake-flake branch from d5d21bf to 60f9f3c Compare January 22, 2024 15:51

kjnilsson added 2 commits January 22, 2024 16:36

kjnilsson force-pushed the flaky-mc-flake-flake branch from d29f28b to d15aadf Compare January 22, 2024 17:18

kjnilsson marked this pull request as ready for review January 22, 2024 20:05

michaelklishin added the backport-v3.12.x label Jan 22, 2024

michaelklishin merged commit c1d37e3 into main Jan 22, 2024
19 checks passed

michaelklishin deleted the flaky-mc-flake-flake branch January 22, 2024 21:22

michaelklishin added this to the 3.13.0 milestone Jan 22, 2024

mergify bot mentioned this pull request Jan 22, 2024

Reduce flakiness of certain Common Test suites (backport #10364) #10392

Closed

michaelklishin added a commit that referenced this pull request Jan 22, 2024

Manually backport one test stability change from #10364 #10392

ae403a8

michaelklishin removed the backport-v3.12.x label Jan 22, 2024

kjnilsson mentioned this pull request Jan 25, 2024

Stream coordinator: fixes to automatic membership changes. #10331

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce flakiness of certain Common Test suites #10364

Reduce flakiness of certain Common Test suites #10364

kjnilsson commented Jan 18, 2024 •

edited

Loading

michaelklishin commented Jan 22, 2024

Reduce flakiness of certain Common Test suites #10364

Reduce flakiness of certain Common Test suites #10364

Conversation

kjnilsson commented Jan 18, 2024 • edited Loading

michaelklishin commented Jan 22, 2024

kjnilsson commented Jan 18, 2024 •

edited

Loading