(scio-smb) Support mixed FileOperations per BucketedInput #5064

clairemcginty · 2023-11-10T17:44:43Z

Basically:

Instead of containing fields [List<Directory>, FilenameSuffix, FileOperations], BucketInputs now contain Map<Directory -> [FilenameSuffix, FileOperations]>
BucketMetadata implementations will now implement a "hash" of their primary/secondary key information, used to assess intra-partition compatibility. Optionally, they can also override a new function, Set<Class<? extends BucketMetadata>> compatibleMetadataTypes(), to specify which types of BucketMetadatas can be mixed in the same BucketedInput (i.e. Avro+Parquet).

codecov · 2023-11-10T18:06:00Z

Codecov Report

Attention: 1 lines in your changes are missing coverage. Please review.

Comparison is base (9eed9a4) 63.33% compared to head (29a6295) 63.35%.
Report is 5 commits behind head on main.

Files	Patch %	Lines
...sdk/extensions/smb/ParquetTypeSortedBucketIO.scala	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #5064      +/-   ##
==========================================
+ Coverage   63.33%   63.35%   +0.02%     
==========================================
  Files         291      291              
  Lines       10837    10843       +6     
  Branches      753      755       +2     
==========================================
+ Hits         6864     6870       +6     
  Misses       3973     3973

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

scio-smb/src/main/java/org/apache/beam/sdk/extensions/smb/SortedBucketSource.java

scio-smb/src/main/java/org/apache/beam/sdk/extensions/smb/AvroBucketMetadata.java

clairemcginty · 2023-12-13T21:33:51Z

scio-smb/src/main/java/org/apache/beam/sdk/extensions/smb/SortedBucketSource.java

+      long numDistinctFileOperations =
+          directories.values().stream().map(kv -> kv.getValue().getClass()).distinct().count();
+
+      // If all partitions use the same file operations type, don't keep re-encoding it


In the past serialized transform size has been an issue (if hundreds+ of partitions are being read), but this solution is definitely a bit convoluted 😅

You could potentially add a header to write out the distinct file operations with indicies and have a reference to those indicies in each of the output objects. Slight increase in overhead if all file ops are different, but would decrease payload size even more if there are some (but not exactly 1) shared operations

that's a good idea - will give it a try

…tCoder

kellen · 2023-12-15T13:09:35Z

How might a user use this change?

scio-smb/src/main/java/org/apache/beam/sdk/extensions/smb/SortedBucketSource.java

kellen · 2023-12-15T13:07:09Z

scio-smb/src/main/java/org/apache/beam/sdk/extensions/smb/SortedBucketSource.java

+      long numDistinctFileOperations =
+          directories.values().stream().map(kv -> kv.getValue().getClass()).distinct().count();
+
+      // If all partitions use the same file operations type, don't keep re-encoding it


You could potentially add a header to write out the distinct file operations with indicies and have a reference to those indicies in each of the output objects. Slight increase in overhead if all file ops are different, but would decrease payload size even more if there are some (but not exactly 1) shared operations

scio-smb/src/main/java/org/apache/beam/sdk/extensions/smb/BucketMetadata.java

scio-smb/src/main/java/org/apache/beam/sdk/extensions/smb/JsonBucketMetadata.java

scio-smb/src/main/java/org/apache/beam/sdk/extensions/smb/SortedBucketIO.java

clairemcginty · 2023-12-18T14:02:53Z

How might a user use this change?

You can use it if you're instantiating a SortedBucketSource transform directly (i.e. creating your own BucketedInput instances), or extend a new SortedBucketIO.Read implementation that accepts mixed file types

shnapz · 2023-12-18T16:11:32Z

scio-smb/src/main/java/org/apache/beam/sdk/extensions/smb/ParquetAvroFileOperations.java

-    return (AvroCoder<ValueT>) AvroCoder.of(getSchema());
+    return recordClass == null
+        ? (AvroCoder<ValueT>) AvroCoder.of(getSchema())
+        : AvroCoder.of(recordClass, true);


How did this work before? Codec for GenericRecord was used for SpecificRecords?

I made a bug in beam apache/beam#29518. In current version it will use SpecificData. Prefer explicit AvroCoder.reflect

scio-smb/src/main/java/org/apache/beam/sdk/extensions/smb/ParquetBucketMetadata.java

shnapz · 2023-12-18T16:57:13Z

scio-smb/src/main/java/org/apache/beam/sdk/extensions/smb/SortedBucketSource.java

@@ -427,18 +442,45 @@ public static <V> BucketedInput<V> of(
          tupleTag, inputDirectories, filenameSuffix, fileOperations, predicate);


what if we remove old constructor signature, as long as interaction is done via of factory method?

clairemcginty · 2023-12-19T16:15:11Z

scio-smb/src/main/java/org/apache/beam/sdk/extensions/smb/SortedBucketIO.java

    public abstract TupleTag<V> getTupleTag();

-    protected abstract BucketedInput<V> toBucketedInput(SortedBucketSource.Keying keying);
+    public abstract BucketedInput<V> toBucketedInput(SortedBucketSource.Keying keying);


We can use this public method to construct SMB taps from the Scala API bindings, which take in SortedBucketIO.Read objects 👍

farzad-sedghi · 2023-12-19T19:38:07Z

scio-smb/src/main/java/org/apache/beam/sdk/extensions/smb/BucketMetadata.java


  public abstract K1 extractKeyPrimary(V value);

  public abstract K2 extractKeySecondary(V value);

+  abstract int hashPrimaryKeyMetadata();
+
+  abstract int hashSecondaryKeyMetadata();


the secondary key impl. (lexicographic) we have today is one way of implementing it, which in some cases like analytics is less efficient/common as opposed to sth like z-ordering. Might be nice to think a bit more to find out with a more future proof API e.g. support N keys?

scio-smb/src/main/java/org/apache/beam/sdk/extensions/smb/FileOperations.java

RustedBones · 2024-01-05T08:56:47Z

scio-smb/src/main/java/org/apache/beam/sdk/extensions/smb/BucketMetadataUtil.java

@@ -86,11 +85,11 @@ int leastNumBuckets() {
  }

  private <V> SourceMetadata<V> getSourceMetadata(
-      List<String> directories,
-      String filenameSuffix,
+      Map<ResourceId, KV<String, FileOperations<V>>> directories,


Nit: This is now a bit more than direcories. Would probably be nice to change naming for smth more precise like directoryOperations

scio-smb/src/main/java/org/apache/beam/sdk/extensions/smb/SortedBucketSource.java

clairemcginty commented Nov 10, 2023

View reviewed changes

scio-smb/src/main/java/org/apache/beam/sdk/extensions/smb/SortedBucketSource.java Show resolved Hide resolved

clairemcginty commented Nov 10, 2023

View reviewed changes

scio-smb/src/main/java/org/apache/beam/sdk/extensions/smb/SortedBucketSource.java Outdated Show resolved Hide resolved

clairemcginty commented Nov 20, 2023

View reviewed changes

scio-smb/src/main/java/org/apache/beam/sdk/extensions/smb/AvroBucketMetadata.java Outdated Show resolved Hide resolved

clairemcginty force-pushed the smb-mixed-input branch 2 times, most recently from 9980a60 to a437c55 Compare December 13, 2023 21:10

clairemcginty commented Dec 13, 2023

View reviewed changes

clairemcginty changed the title ~~Support mixed FileOperations per source in SortedBucketSource~~ (scio-smb) Support mixed FileOperations per BucketedInput Dec 14, 2023

clairemcginty added 10 commits December 14, 2023 12:15

Support mixed FileOperations per source in SortedBucketSource

9f8072e

cleanup

52bd353

Include filenameSuffix with fileOperations

b072035

fix Coder compat check; fix test

87cf839

Refactor partition-compatibility check to use hashed value

1be2bbe

cleanup

95f75f9

Less wasteful ptransform serialization

563fb43

ParquetAvroFileOperations getCoder should match AvroFileOperations ge…

00f41da

…tCoder

Test specific + generic reads

d62b24f

fetchMetadata must use a List for deterministic batching

a45419c

clairemcginty force-pushed the smb-mixed-input branch from a11633b to a45419c Compare December 14, 2023 17:16

clairemcginty marked this pull request as ready for review December 14, 2023 17:19

clairemcginty added 3 commits December 14, 2023 13:01

Make SortedBucketIO.Read extendable for multi-format

2825981

Fixup SortedBucketIO.Read signature

a4e2193

Make getInputDirectories() public for extensions

1925d7e

kellen reviewed Dec 15, 2023

View reviewed changes

RustedBones reviewed Dec 15, 2023

View reviewed changes

PR comments

875ae0e

shnapz reviewed Dec 18, 2023

View reviewed changes

scio-smb/src/main/java/org/apache/beam/sdk/extensions/smb/ParquetBucketMetadata.java Show resolved Hide resolved

Encode BucketedInput as indexed map of FileOperations/FileSuffixes

ab1ec15

shnapz reviewed Dec 18, 2023

View reviewed changes

Add BucketedInput#getInputs to unlock SMB taps

25dbe46

clairemcginty commented Dec 19, 2023

View reviewed changes

clairemcginty added 2 commits December 19, 2023 11:15

Use AvroCoder.reflect

0350664

Use local variable

ae11f9a

farzad-sedghi reviewed Dec 19, 2023

View reviewed changes

clairemcginty added this to the 0.14.0 milestone Jan 4, 2024

RustedBones reviewed Jan 5, 2024

View reviewed changes

clairemcginty added 3 commits January 5, 2024 08:59

Use SerializableCoder

8f39bd1

cleanup

4271462

move coders to static variables

29a6295

RustedBones approved these changes Jan 8, 2024

View reviewed changes

clairemcginty merged commit f77249a into main Jan 8, 2024
11 checks passed

clairemcginty deleted the smb-mixed-input branch January 8, 2024 14:41

clairemcginty mentioned this pull request Jan 18, 2024

Remove abstract getFileOperations from SortedBucketIO.Read #5182

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(scio-smb) Support mixed FileOperations per BucketedInput #5064

(scio-smb) Support mixed FileOperations per BucketedInput #5064

clairemcginty commented Nov 10, 2023 •

edited

Loading

codecov bot commented Nov 10, 2023 •

edited

Loading

clairemcginty Dec 13, 2023

kellen Dec 15, 2023

clairemcginty Dec 18, 2023

kellen commented Dec 15, 2023

kellen Dec 15, 2023

clairemcginty commented Dec 18, 2023

shnapz Dec 18, 2023

RustedBones Dec 19, 2023

shnapz Dec 18, 2023

clairemcginty Dec 19, 2023

farzad-sedghi Dec 19, 2023 •

edited

Loading

RustedBones Jan 5, 2024

		@@ -427,18 +442,45 @@ public static <V> BucketedInput<V> of(
		tupleTag, inputDirectories, filenameSuffix, fileOperations, predicate);

(scio-smb) Support mixed FileOperations per BucketedInput #5064

(scio-smb) Support mixed FileOperations per BucketedInput #5064

Conversation

clairemcginty commented Nov 10, 2023 • edited Loading

codecov bot commented Nov 10, 2023 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kellen commented Dec 15, 2023

Choose a reason for hiding this comment

clairemcginty commented Dec 18, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

farzad-sedghi Dec 19, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clairemcginty commented Nov 10, 2023 •

edited

Loading

codecov bot commented Nov 10, 2023 •

edited

Loading

farzad-sedghi Dec 19, 2023 •

edited

Loading