-
Notifications
You must be signed in to change notification settings - Fork 580
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI Failure (wait_until segment_number_matches - compaction stuck) in CompactionE2EIdempotencyTest.test_basic_compaction
#8492
Comments
It looks like compaction isn't happening for two out three partitions
|
if we grep |
CompactionE2EIdempotencyTest.test_basic_compaction
CompactionE2EIdempotencyTest.test_basic_compaction
tinkering with this test in #8687 What I see in the last rebase is that the tests do not fail anymore at The runs that are failing are doing it in two spots: While checking that enough segments are produced before reducing the compaction interval
and at remote_wait_consumer. this one is more concerning
|
@andijcr there is already an issue for that #8698 there is some difference between position and last offset obtained via jumping to the end but all written data was consumed, I've added a workaround (made the test less strict) there - 3e4a479 but current issue (wait_until segment_number_matches) still fails! see same PR #8624 failing build - https://buildkite.com/redpanda/redpanda/builds/22887#01863754-84ea-4212-aec6-2c3ccf6138c8 |
I don't see the link between initialization of the txn fields and the segment problem, can you explain it? |
you run 20 iteration, on my PR the rate was 2/50 so we don't have a clear signal that it's truly fixed |
ok after 50 reties there still was a failure #8783 (comment) |
The reasoning was to expand on Bharath fix by initializing fields of type duration, since it's UB to access them without initialization. I wasn't able to find a code path, so I took a cautionary approach initializing whatever seemed suspicious. At least from ducktape seems like the frequency of the failure decreased. I tried to run msan and valgrind, but even thought I was able to compile with some modifications, the first crashed immediately on nettle and the second seems to raise false positives. as of now I'm analyzing the latest ducktape failed run, 3 nodes and 3 partitions, counting the lines with "Creating new segment" and "Removing <segment_path>" we see that partition 1 and 2 have a proportional number of creation/removal for the three nodes, but partition 0 does not get removal lines on node docker-rp-9. This makes the test fail, but one weirdness is that
This files are the lines of partition 0 for docker-rp-9. rp-9-partition-0.txt |
While checking this line from test_log.debug
I noticed that the checks are run only on the first node. I opened a pr to extend the check on the other 2 nodes #8927 Slightly unrelated to the main bug: |
It is present in other runs in https://buildkite.com/redpanda/redpanda/builds/23027#01863dee-e467-4fa2-a905-ccb824ac66bf and others summarizer #!/bin/bash
sp="/-\|"
sc=0
spin() {
printf '\b%s' "${sp:sc++:1}"
((sc==${#sp})) && sc=0
}
endspin() {
printf "\r%s\n" "$@"
}
check_segments_at () (
set -e
cd -- "$1"
grep -c "Creating new segment /var/lib/redpanda/data/kafka/topic-[^/]*/0" redpanda.log > segments_created || true
grep -c "Creating new segment /var/lib/redpanda/data/kafka/topic-[^/]*/1" redpanda.log >> segments_created || true
grep -c "Creating new segment /var/lib/redpanda/data/kafka/topic-[^/]*/2" redpanda.log >> segments_created || true
grep -c "Removing \"/var/lib/redpanda/data/kafka/topic-[^/]*/0" redpanda.log > segments_removed || true
grep -c "Removing \"/var/lib/redpanda/data/kafka/topic-[^/]*/1" redpanda.log >> segments_removed || true
grep -c "Removing \"/var/lib/redpanda/data/kafka/topic-[^/]*/2" redpanda.log >> segments_removed || true
paste segments_created segments_removed | awk '{if(($1 - $2) < 5) printf "" > "res_ok"; else printf "" > "res_not"; }'
)
shopt -s nullglob
nodes=(ducktape*/results/latest/CompactionE2EIdempotencyTest/test_basic_compaction/*/*/RedpandaService-0*/docker-rp-*)
for path in "${nodes[@]}"; do
spin
check_segments_at "$path"
done
endspin
find . -name "res_not"
|
It's a low impact redpanda bug. Under certain circumstances the update to |
#8927 now contains the fix, I'll run CI to gain confidence |
This is in a new test in #8413
https://buildkite.com/redpanda/redpanda/builds/22034#0185fbf7-26e3-4971-b008-bc94644644c1
The text was updated successfully, but these errors were encountered: