Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(scio-smb) Support mixed FileOperations per BucketedInput #5064

Merged
merged 21 commits into from
Jan 8, 2024

Conversation

clairemcginty
Copy link
Contributor

@clairemcginty clairemcginty commented Nov 10, 2023

Basically:

  • Instead of containing fields [List<Directory>, FilenameSuffix, FileOperations], BucketInputs now contain Map<Directory -> [FilenameSuffix, FileOperations]>
  • BucketMetadata implementations will now implement a "hash" of their primary/secondary key information, used to assess intra-partition compatibility. Optionally, they can also override a new function, Set<Class<? extends BucketMetadata>> compatibleMetadataTypes(), to specify which types of BucketMetadatas can be mixed in the same BucketedInput (i.e. Avro+Parquet).

Copy link

codecov bot commented Nov 10, 2023

Codecov Report

Attention: 1 lines in your changes are missing coverage. Please review.

Comparison is base (9eed9a4) 63.33% compared to head (29a6295) 63.35%.
Report is 5 commits behind head on main.

Files Patch % Lines
...sdk/extensions/smb/ParquetTypeSortedBucketIO.scala 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5064      +/-   ##
==========================================
+ Coverage   63.33%   63.35%   +0.02%     
==========================================
  Files         291      291              
  Lines       10837    10843       +6     
  Branches      753      755       +2     
==========================================
+ Hits         6864     6870       +6     
  Misses       3973     3973              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@clairemcginty clairemcginty force-pushed the smb-mixed-input branch 2 times, most recently from 9980a60 to a437c55 Compare December 13, 2023 21:10
long numDistinctFileOperations =
directories.values().stream().map(kv -> kv.getValue().getClass()).distinct().count();

// If all partitions use the same file operations type, don't keep re-encoding it
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the past serialized transform size has been an issue (if hundreds+ of partitions are being read), but this solution is definitely a bit convoluted 😅

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could potentially add a header to write out the distinct file operations with indicies and have a reference to those indicies in each of the output objects. Slight increase in overhead if all file ops are different, but would decrease payload size even more if there are some (but not exactly 1) shared operations

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a good idea - will give it a try

@clairemcginty clairemcginty changed the title Support mixed FileOperations per source in SortedBucketSource (scio-smb) Support mixed FileOperations per BucketedInput Dec 14, 2023
@clairemcginty clairemcginty marked this pull request as ready for review December 14, 2023 17:19
@kellen
Copy link
Contributor

kellen commented Dec 15, 2023

How might a user use this change?

long numDistinctFileOperations =
directories.values().stream().map(kv -> kv.getValue().getClass()).distinct().count();

// If all partitions use the same file operations type, don't keep re-encoding it
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could potentially add a header to write out the distinct file operations with indicies and have a reference to those indicies in each of the output objects. Slight increase in overhead if all file ops are different, but would decrease payload size even more if there are some (but not exactly 1) shared operations

@clairemcginty
Copy link
Contributor Author

How might a user use this change?

You can use it if you're instantiating a SortedBucketSource transform directly (i.e. creating your own BucketedInput instances), or extend a new SortedBucketIO.Read implementation that accepts mixed file types

return (AvroCoder<ValueT>) AvroCoder.of(getSchema());
return recordClass == null
? (AvroCoder<ValueT>) AvroCoder.of(getSchema())
: AvroCoder.of(recordClass, true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did this work before? Codec for GenericRecord was used for SpecificRecords?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a bug in beam apache/beam#29518. In current version it will use SpecificData. Prefer explicit AvroCoder.reflect

@@ -427,18 +442,45 @@ public static <V> BucketedInput<V> of(
tupleTag, inputDirectories, filenameSuffix, fileOperations, predicate);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if we remove old constructor signature, as long as interaction is done via of factory method?

public abstract TupleTag<V> getTupleTag();

protected abstract BucketedInput<V> toBucketedInput(SortedBucketSource.Keying keying);
public abstract BucketedInput<V> toBucketedInput(SortedBucketSource.Keying keying);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use this public method to construct SMB taps from the Scala API bindings, which take in SortedBucketIO.Read objects 👍


public abstract K1 extractKeyPrimary(V value);

public abstract K2 extractKeySecondary(V value);

abstract int hashPrimaryKeyMetadata();

abstract int hashSecondaryKeyMetadata();
Copy link
Contributor

@farzad-sedghi farzad-sedghi Dec 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the secondary key impl. (lexicographic) we have today is one way of implementing it, which in some cases like analytics is less efficient/common as opposed to sth like z-ordering. Might be nice to think a bit more to find out with a more future proof API e.g. support N keys?

@clairemcginty clairemcginty added this to the 0.14.0 milestone Jan 4, 2024
@@ -86,11 +85,11 @@ int leastNumBuckets() {
}

private <V> SourceMetadata<V> getSourceMetadata(
List<String> directories,
String filenameSuffix,
Map<ResourceId, KV<String, FileOperations<V>>> directories,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This is now a bit more than direcories. Would probably be nice to change naming for smth more precise like directoryOperations

@clairemcginty clairemcginty merged commit f77249a into main Jan 8, 2024
11 checks passed
@clairemcginty clairemcginty deleted the smb-mixed-input branch January 8, 2024 14:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants