Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serialize BucketMetadata#HashType as String? #5193

Closed
clairemcginty opened this issue Jan 22, 2024 · 0 comments
Closed

Serialize BucketMetadata#HashType as String? #5193

clairemcginty opened this issue Jan 22, 2024 · 0 comments

Comments

@clairemcginty
Copy link
Contributor

clairemcginty commented Jan 22, 2024

Currently, BucketMetadata is written to JSON including the HashFunction enum value of its HashType, i.e.:

{
  "numBuckets": 2,
  "numShards": 1,
  "keyClass": "java.lang.String",
  "hashType": "MURMUR3_32",
  "keyField":" user_id"
}

However, this requires that all SMB reads have the writer's HashFunction type available on the classpath in order to deserialize metadata.json back into a BucketMetadata. As a result, if we ever evolve the available HashFunctions and an SMB writer wants to use a new one, all of its consumers will have to upgrade to the latest Scio to be able to deserialize its BucketMetadata.

The reader only needs to know about HashFunction in order to assert that it's the same across all partitions/sources -- and we could use a String for that purpose. Should we future-proof the SMB API by serializing HashFunction explicitly as a String?

i.e.

public abstract class BucketMetadata<K1, K2, V> implements Serializable, HasDisplayData {
-   @JsonProperty private final HashType hashType;
+   @JsonProperty private final String hashType;

+ // Used by writer code
  public HashType getHashType() {
-   return hashType;
+   return HashType.valueOf(hashType);
  }

+ // Works mostly as-is
  boolean isCompatibleWith(BucketMetadata other) {
    return other != null
        // version 1 is backwards compatible with version 0
        && (this.version <= 1 && other.version <= 1)
-        && this.hashType == other.hashType
+        && this.hashType.equals(other.hashType)
        // This check should be redundant since power of two is checked in BucketMetadata
        // constructor, but it's cheap to double-check.
        && (Math.max(numBuckets, other.numBuckets) % Math.min(numBuckets, other.numBuckets) == 0);
  }
}

Since HashType is already written using its String representation there should be no cross-version compatibility issues associated with this change -- you'd still be able to read old metadata.json files

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant