[BEAM-14546] Fix errant pass for empty collections in Count #17813

jrmccluskey · 2022-06-02T14:37:05Z

Adds a NonEmpty() passert utility to check that a given PCollection has at least one element, then uses it to ensure that a collection passed to Count() with a non-zero expected number of elements has at least one element to avoid an erroneous passing test.

Discovered and fixed an instance of erroneous passing with the synthetic steps tests, and applies the same fix to passert.Sum.

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Choose reviewer(s) and mention them in a comment (R: @username).
Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
Update CHANGES.md with noteworthy changes.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

See CI.md for more information about GitHub Actions CI.

codecov · 2022-06-02T14:40:11Z

Codecov Report

Merging #17813 (b3955fc) into master (999bcea) will increase coverage by 0.02%.
The diff coverage is 86.20%.

@@            Coverage Diff             @@
##           master   #17813      +/-   ##
==========================================
+ Coverage   74.09%   74.11%   +0.02%     
==========================================
  Files         697      697              
  Lines       91980    92036      +56     
==========================================
+ Hits        68148    68212      +64     
+ Misses      22583    22574       -9     
- Partials     1249     1250       +1

Flag	Coverage Δ
go	`50.95% <86.20%> (+0.13%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
sdks/go/pkg/beam/io/synthetic/step.go	`81.81% <71.42%> (ø)`
sdks/go/pkg/beam/testing/passert/passert.shims.go	`62.02% <82.35%> (+4.79%)`	⬆️
sdks/go/pkg/beam/testing/passert/count.go	`79.16% <100.00%> (+2.97%)`	⬆️
sdks/go/pkg/beam/testing/passert/hash.go	`28.00% <100.00%> (+28.00%)`	⬆️
sdks/go/pkg/beam/testing/passert/passert.go	`81.11% <100.00%> (+2.36%)`	⬆️
sdks/go/pkg/beam/testing/passert/sum.go	`100.00% <100.00%> (ø)`
sdks/go/pkg/beam/core/runtime/exec/fn.go	`69.55% <0.00%> (ø)`
sdks/go/pkg/beam/core/runtime/exec/sdf.go	`70.84% <0.00%> (+0.23%)`	⬆️
sdks/go/pkg/beam/core/runtime/exec/pardo.go	`59.43% <0.00%> (+0.32%)`	⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 999bcea...b3955fc. Read the comment docs.

asf-ci · 2022-06-02T14:53:13Z

Can one of the admins verify this patch?

asf-ci · 2022-06-02T14:54:16Z

Can one of the admins verify this patch?

github-actions · 2022-06-02T15:04:39Z

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @damccorm for label go.

Available commands:

stop reviewer notifications - opt out of the automated review tooling
remind me after tests pass - tag the comment author after tests pass
waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

damccorm · 2022-06-02T15:21:58Z

sdks/go/pkg/beam/testing/passert/count_test.go

+			if err := ptest.Run(p); err != nil {
+				t.Errorf("Pipeline failed: %v", err)
+			}
+		})
 	}
 }

 func TestCount_Bad(t *testing.T) {


Can we simplify and combine these tests into 1 by adding a expectErr variable to the tests struct?

We could, although I don't see too much value in bundling them into one suite. The setup isn't particularly long or complicated to test the function, so deduplicating it doesn't add much value IMO. Totally cool doing it if you feel strongly though.

I'd argue its worth doing since more/duplicated code => more opportunities for bugs to slip in when updates are needed and more for a future developer (maybe us) to understand (code is a liability). I'm not going to block on it though, its not very important

damccorm · 2022-06-02T15:24:48Z

sdks/go/pkg/beam/testing/passert/count.go

@@ -30,6 +30,10 @@ func Count(s beam.Scope, col beam.PCollection, name string, count int) {
 	if typex.IsKV(col.Type()) {
 		col = beam.DropKey(s, col)
 	}
+
+	if count > 0 {
+		NonEmpty(s, col)


Do we need to add the same thing to Hash and Sum? I think an empty pcollection would silently pass for both of those as well

Sum did need it, confirmed via unit test. Will add some short validation for Hash

Added a small "fail on empty" check for Hash.

damccorm · 2022-06-02T15:31:38Z

The portable failure looks legit as well:

11:01:07 2022/06/02 14:58:41 2022-06-02 14:58:41. (26): Traceback (most recent call last):
11:01:07   File "/home/jenkins/jenkins-slave/workspace/beam_PreCommit_GoPortable_Commit/src/sdks/python/apache_beam/runners/portability/local_job_service.py", line 275, in _run_job
11:01:07     self.result = self._invoke_runner()
11:01:07   File "/home/jenkins/jenkins-slave/workspace/beam_PreCommit_GoPortable_Commit/src/sdks/python/apache_beam/runners/portability/local_job_service.py", line 297, in _invoke_runner
11:01:07     return fn_runner.FnApiRunner(
11:01:07   File "/home/jenkins/jenkins-slave/workspace/beam_PreCommit_GoPortable_Commit/src/sdks/python/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 208, in run_via_runner_api
11:01:07     return self.run_stages(stage_context, stages)
11:01:07   File "/home/jenkins/jenkins-slave/workspace/beam_PreCommit_GoPortable_Commit/src/sdks/python/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 408, in run_stages
11:01:07     bundle_results = self._execute_bundle(
11:01:07   File "/home/jenkins/jenkins-slave/workspace/beam_PreCommit_GoPortable_Commit/src/sdks/python/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 736, in _execute_bundle
11:01:07     self._run_bundle(
11:01:07   File "/home/jenkins/jenkins-slave/workspace/beam_PreCommit_GoPortable_Commit/src/sdks/python/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 965, in _run_bundle
11:01:07     result, splits = bundle_manager.process_bundle(
11:01:07   File "/home/jenkins/jenkins-slave/workspace/beam_PreCommit_GoPortable_Commit/src/sdks/python/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 1314, in process_bundle
11:01:07     raise RuntimeError(result.error)
11:01:07 RuntimeError: process bundle failed for instruction bundle_15 using plan 2 : while executing Process for Plan[2]:
11:01:07 2: ParDo[passert.nonEmptyFn] Out:[]
11:01:07 1: DataSource[S[passert.Count(out)/passert.NonEmpty/Impulse@localhost:38365], i0] Coder:W;c0_windowed<bytes;c0>!GWC Out:2
11:01:07 	caused by:
11:01:07 DoFn[UID:2, PID:passert.Count(out)/passert.NonEmpty/passert.nonEmptyFn, Name: github.com/apache/beam/sdks/v2/go/pkg/beam/testing/passert.nonEmptyFn] failed:
11:01:07 PCollection is empty, want non-empty collection
11:01:07 2022/06/02 14:58:41 2022-06-02 14:58:41. (27): Error running pipeline.
11:01:07 Traceback (most recent call last):
11:01:07   File "/home/jenkins/jenkins-slave/workspace/beam_PreCommit_GoPortable_Commit/src/sdks/python/apache_beam/runners/portability/local_job_service.py", line 275, in _run_job
11:01:07     self.result = self._invoke_runner()
11:01:07   File "/home/jenkins/jenkins-slave/workspace/beam_PreCommit_GoPortable_Commit/src/sdks/python/apache_beam/runners/portability/local_job_service.py", line 297, in _invoke_runner
11:01:07     return fn_runner.FnApiRunner(
11:01:07   File "/home/jenkins/jenkins-slave/workspace/beam_PreCommit_GoPortable_Commit/src/sdks/python/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 208, in run_via_runner_api
11:01:07     return self.run_stages(stage_context, stages)
11:01:07   File "/home/jenkins/jenkins-slave/workspace/beam_PreCommit_GoPortable_Commit/src/sdks/python/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 408, in run_stages
11:01:07     bundle_results = self._execute_bundle(
11:01:07   File "/home/jenkins/jenkins-slave/workspace/beam_PreCommit_GoPortable_Commit/src/sdks/python/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 736, in _execute_bundle
11:01:07     self._run_bundle(
11:01:07   File "/home/jenkins/jenkins-slave/workspace/beam_PreCommit_GoPortable_Commit/src/sdks/python/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 965, in _run_bundle
11:01:07     result, splits = bundle_manager.process_bundle(
11:01:07   File "/home/jenkins/jenkins-slave/workspace/beam_PreCommit_GoPortable_Commit/src/sdks/python/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 1314, in process_bundle
11:01:07     raise RuntimeError(result.error)
11:01:07 RuntimeError: process bundle failed for instruction bundle_15 using plan 2 : while executing Process for Plan[2]:
11:01:07 2: ParDo[passert.nonEmptyFn] Out:[]
11:01:07 1: DataSource[S[passert.Count(out)/passert.NonEmpty/Impulse@localhost:38365], i0] Coder:W;c0_windowed<bytes;c0>!GWC Out:2
11:01:07 	caused by:
11:01:07 DoFn[UID:2, PID:passert.Count(out)/passert.NonEmpty/passert.nonEmptyFn, Name: github.com/apache/beam/sdks/v2/go/pkg/beam/testing/passert.nonEmptyFn] failed:
11:01:07 PCollection is empty, want non-empty collection
11:01:07 2022/06/02 14:58:41 Job state: FAILED
11:01:07     ptest.go:108: Failed to execute job: job go-testsimplepipeline-e0c742f9-4ac7-4f8f-9eff-73c54d6d8617 failed
11:01:07 --- FAIL: TestSimplePipeline (12.70s)

Not sure if the test is broken (and erroneously passing) or if something here is broken

jrmccluskey · 2022-06-02T16:12:14Z

It's strange, from what I can tell the problem lies in the synthetic code since Count() gets used heavily in a number of places and we have both unit testing and the other integration tests passing. Let me see if that's reproducible on other runners

jrmccluskey · 2022-06-02T16:12:27Z

Run Go Flink ValidatesRunner

jrmccluskey · 2022-06-02T16:12:45Z

Run Go PostCommit

jrmccluskey · 2022-06-02T17:04:22Z

Found the problem, the synthetic StepCfg struct was not exported so the number of elements to emit in the synthetic tests was getting set to 0, producing empty PCollections

jrmccluskey · 2022-06-02T17:38:01Z

Run Go Flink ValidatesRunner

damccorm · 2022-06-02T17:41:25Z

sdks/go/pkg/beam/testing/passert/count_test.go

+			if err := ptest.Run(p); err != nil {
+				t.Errorf("Pipeline failed: %v", err)
+			}
+		})
 	}
 }

 func TestCount_Bad(t *testing.T) {


I'd argue its worth doing since more/duplicated code => more opportunities for bugs to slip in when updates are needed and more for a future developer (maybe us) to understand (code is a liability). I'm not going to block on it though, its not very important

github-actions · 2022-06-02T17:50:51Z

R: @youngoli for final approval

jrmccluskey · 2022-06-02T18:25:07Z

R: @lostluck since they have context

github-actions · 2022-06-02T18:27:13Z

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control

lostluck

LGTM Thanks!

jrmccluskey · 2022-06-03T13:51:40Z

Run GoPortable PreCommit

[BEAM-14546] Fix errant pass for empty collections in Count

d8d897e

github-actions bot added the go label Jun 2, 2022

Formatting

9e986e7

github-actions bot added the Next Action: Reviewers label Jun 2, 2022

damccorm reviewed Jun 2, 2022

View reviewed changes

Export config field for synthetic step DoFns

5f685f1

github-actions bot added io and removed io labels Jun 2, 2022

jrmccluskey added 2 commits June 2, 2022 13:00

Add NonEmpty check to sum

325dd8b

Add NonEmpty check to Hash

54032c3

Add empty collection hash test

b3955fc

damccorm approved these changes Jun 2, 2022

View reviewed changes

github-actions bot added io and removed io labels Jun 2, 2022

lostluck approved these changes Jun 2, 2022

View reviewed changes

lostluck merged commit c072c36 into apache:master Jun 3, 2022

jrmccluskey deleted the theCount branch June 15, 2022 18:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BEAM-14546] Fix errant pass for empty collections in Count #17813

[BEAM-14546] Fix errant pass for empty collections in Count #17813

jrmccluskey commented Jun 2, 2022 •

edited by lostluck

Loading

codecov bot commented Jun 2, 2022 •

edited

Loading

asf-ci commented Jun 2, 2022

asf-ci commented Jun 2, 2022

github-actions bot commented Jun 2, 2022

damccorm Jun 2, 2022

jrmccluskey Jun 2, 2022

damccorm Jun 2, 2022

damccorm Jun 2, 2022

jrmccluskey Jun 2, 2022

jrmccluskey Jun 2, 2022

damccorm commented Jun 2, 2022

jrmccluskey commented Jun 2, 2022

jrmccluskey commented Jun 2, 2022

jrmccluskey commented Jun 2, 2022

jrmccluskey commented Jun 2, 2022

jrmccluskey commented Jun 2, 2022

damccorm Jun 2, 2022

github-actions bot commented Jun 2, 2022

jrmccluskey commented Jun 2, 2022

github-actions bot commented Jun 2, 2022

lostluck left a comment

jrmccluskey commented Jun 3, 2022

[BEAM-14546] Fix errant pass for empty collections in Count #17813

[BEAM-14546] Fix errant pass for empty collections in Count #17813

Conversation

jrmccluskey commented Jun 2, 2022 • edited by lostluck Loading

GitHub Actions Tests Status (on master branch)

codecov bot commented Jun 2, 2022 • edited Loading

Codecov Report

asf-ci commented Jun 2, 2022

asf-ci commented Jun 2, 2022

github-actions bot commented Jun 2, 2022

damccorm Jun 2, 2022

Choose a reason for hiding this comment

jrmccluskey Jun 2, 2022

Choose a reason for hiding this comment

damccorm Jun 2, 2022

Choose a reason for hiding this comment

damccorm Jun 2, 2022

Choose a reason for hiding this comment

jrmccluskey Jun 2, 2022

Choose a reason for hiding this comment

jrmccluskey Jun 2, 2022

Choose a reason for hiding this comment

damccorm commented Jun 2, 2022

jrmccluskey commented Jun 2, 2022

jrmccluskey commented Jun 2, 2022

jrmccluskey commented Jun 2, 2022

jrmccluskey commented Jun 2, 2022

jrmccluskey commented Jun 2, 2022

damccorm Jun 2, 2022

Choose a reason for hiding this comment

github-actions bot commented Jun 2, 2022

jrmccluskey commented Jun 2, 2022

github-actions bot commented Jun 2, 2022

lostluck left a comment

Choose a reason for hiding this comment

jrmccluskey commented Jun 3, 2022

jrmccluskey commented Jun 2, 2022 •

edited by lostluck

Loading

codecov bot commented Jun 2, 2022 •

edited

Loading