Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI Failure (timeout waiting for transform workload) in DataTransformsLoggingTest.data_transforms_test #16961

Closed
andrwng opened this issue Mar 8, 2024 · 13 comments
Assignees
Labels
area/wasm WASM Data Transforms ci-failure kind/bug Something isn't working

Comments

@andrwng
Copy link
Contributor

andrwng commented Mar 8, 2024

Link is from the build of a v23.3.x backport, but I don't see this tracked as an open issue yet.

https://buildkite.com/redpanda/redpanda/builds/45814#018e1a53-991c-4013-a65a-c5330cbbe781

Module: rptest.tests.data_transforms_test
Class:  DataTransformsLoggingTest
Method: test_logs_volume
test_id:    rptest.tests.data_transforms_test.DataTransformsLoggingTest.test_logs_volume
status:     FAIL
run time:   54.604 seconds


    TimeoutError('Timed out for transform verifier to complete TransformVerifierService-1-139828557698416')
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 184, in _do_run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 269, in run_test
    return self.test_context.function(self.test)
  File "/root/tests/rptest/services/cluster.py", line 82, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/data_transforms_test.py", line 520, in test_logs_volume
    consumer_status = self._consume_output_topic(topic=output_topic,
  File "/root/tests/rptest/tests/data_transforms_test.py", line 158, in _consume_output_topic
    result = TransformVerifierService.oneshot(
  File "/root/tests/rptest/services/transform_verifier_service.py", line 191, in oneshot
    service.wait(timeout_sec=timeout_sec)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/services/service.py", line 287, in wait
    if not self.wait_node(node, end - now):
  File "/root/tests/rptest/services/transform_verifier_service.py", line 247, in wait_node
    wait_until(
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 57, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError: Timed out for transform verifier to complete TransformVerifierService-1-139828557698416

JIRA Link: CORE-1863

@andrwng andrwng added kind/bug Something isn't working ci-failure area/wasm WASM Data Transforms labels Mar 8, 2024
@rockwotj rockwotj self-assigned this Mar 8, 2024
@rockwotj rockwotj added the ci-rca/test CI Root Cause Analysis - Test Issue label Mar 8, 2024
@rockwotj
Copy link
Contributor

rockwotj commented Mar 8, 2024

Partition 2 never seemed to make any progress for the transform:

[DEBUG - 2024-03-07 19:51:06,104 - transform_verifier_service - _get_status_for_node - lineno:344]: Status endpoint TransformVerifierService-1-139828557698416 response: TransformVerifierConsumeStatus(latest_seqnos={'0': 114, '1': 114, '3': 114, '4': 114, '5': 114, '6': 114, '7': 114, '8': 114}, invalid_records=0, error_count=0)

@rockwotj
Copy link
Contributor

rockwotj commented Mar 8, 2024

TRACE 2024-03-07 19:50:55,709 [shard 1:tran] transform - identity-logging_xform/2 - transform_processor.cc:228 - consumed up to offset 114
TRACE 2024-03-07 19:50:55,714 [shard 1:tran] storage - readers_cache.cc:327 - {kafka/topic-heauhqtpba/2} - removing reader: [0,114] lower_bound: 115

Partition 2 did finish reading all the data from the input partition

@rockwotj rockwotj changed the title [v23.3.x] CI Failure (timeout waiting for transform workload) in DataTransformsLoggingTest. data_transforms_test CI Failure (timeout waiting for transform workload) in DataTransformsLoggingTest. data_transforms_test Mar 8, 2024
@rockwotj rockwotj removed the ci-rca/test CI Root Cause Analysis - Test Issue label Mar 8, 2024
@rockwotj
Copy link
Contributor

One test failure from this I was looking at, the destination topic seemed to get the write, but a few seconds later we never see those records in the verifier...

TRACE 2024-03-11 21:47:01,266 [shard 0:tran] raft - [group_id:22, {kafka/topic-fokphetxgm/4}] replicate_entries_stm.cc:428 - Replication success, last offset: 115, term: 2

@rockwotj
Copy link
Contributor

I wonder if we should move produce requests back to the raft/main scheduling group...

@travisdowns travisdowns added the ci-ignore Automatic ci analysis tools ignore this issue label Mar 13, 2024
@travisdowns
Copy link
Member

Marking as ci-ignore as pandatriage can't handle failures on backport branches.

@travisdowns travisdowns changed the title CI Failure (timeout waiting for transform workload) in DataTransformsLoggingTest. data_transforms_test CI Failure (timeout waiting for transform workload) in DataTransformsLoggingTest.data_transforms_test Mar 14, 2024
@travisdowns
Copy link
Member

travisdowns commented Mar 14, 2024

https://buildkite.com/redpanda/redpanda/builds/45858#018e1d43-de3d-46f6-bd30-89a5be7a937b

(this one on dev so I can remove ci-ignore)

@travisdowns travisdowns removed the ci-ignore Automatic ci analysis tools ignore this issue label Mar 14, 2024
@vbotbuildovich
Copy link
Collaborator

@vbotbuildovich
Copy link
Collaborator

@vbotbuildovich
Copy link
Collaborator

@rockwotj
Copy link
Contributor

More debug mode failures, going to push on making debug mode -O1

@rockwotj rockwotj added the stale label Mar 29, 2024
@rockwotj
Copy link
Contributor

rockwotj commented Apr 1, 2024

Closing as we use -O1 now and will see if there is a occurrence.

@rockwotj rockwotj closed this as completed Apr 1, 2024
@ztlpn
Copy link
Contributor

ztlpn commented May 1, 2024

Reopening, as this is still happening (also 23.3.x backport): https://buildkite.com/redpanda/redpanda/builds/48546#018f3202-489b-4585-b8f5-eaa9eec27c43

@piyushredpanda
Copy link
Contributor

Not seen in at 6 weeks, closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/wasm WASM Data Transforms ci-failure kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants