-
Notifications
You must be signed in to change notification settings - Fork 513
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add helper methods for Parquet ReadFiles transforms #4801
Conversation
@@ -43,8 +58,92 @@ object ParquetRead { | |||
readSupportFactory: ReadSupportFactory[T], | |||
conf: SerializableConfiguration, | |||
projectionFn: SerializableFunction[T, R] | |||
): ParDo.SingleOutput[ReadableFile, R] = { | |||
): PTransform[PCollection[ReadableFile], PCollection[R]] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will technically break MiMa :( but on the bright side, I think this method has basically zero adoption, so it should be ok. The reason is that ParDo.of(..)
constructs a PTrasnform[PCollection[_ <: ReadableFile], ...]
. Thus, the pre-existing signature didn't work when used with SCollection#readFiles as the signatures didn't match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe candidate to 0.13?
Codecov Report
@@ Coverage Diff @@
## main #4801 +/- ##
==========================================
+ Coverage 62.84% 62.96% +0.11%
==========================================
Files 282 282
Lines 10573 10641 +68
Branches 795 779 -16
==========================================
+ Hits 6645 6700 +55
- Misses 3928 3941 +13
|
on second thought, maybe I can move these methods into |
@@ -43,8 +58,92 @@ object ParquetRead { | |||
readSupportFactory: ReadSupportFactory[T], | |||
conf: SerializableConfiguration, | |||
projectionFn: SerializableFunction[T, R] | |||
): ParDo.SingleOutput[ReadableFile, R] = { | |||
): PTransform[PCollection[ReadableFile], PCollection[R]] = { | |||
val sdf = new ParquetReadFn[T, R](readSupportFactory, conf, projectionFn) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so readFiles
API is using only splittable reads, right? does it perform the same as legacy on runner v1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think it's a little less performant on RunnerV1 (doesn't scale up as well).
scio-parquet/src/main/scala/com/spotify/scio/parquet/read/ParquetRead.scala
Outdated
Show resolved
Hide resolved
) | ||
|
||
def readTyped[T: ClassTag: Coder: ParquetType, R]( | ||
projectionFn: T => R, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why 2 APIs and not setting identity
as default projectionFn ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to make the API friendlier for users who aren't supplying a projection, so they con't have to specify their T
type twice
scio-parquet/src/main/scala/com/spotify/scio/parquet/read/ParquetRead.scala
Outdated
Show resolved
Hide resolved
|
||
def readAvroGenericRecordFiles[T]( | ||
schema: Schema, | ||
projectionFn: GenericRecord => T, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not look to be a projection: We are reading the whole schema data even if the function selects a single field of the GenericRecord.
Option(predicate).foreach(p => ParquetInputFormat.setFilterPredicate(configuration, p)) | ||
|
||
val cleanedFn = Functions.serializableFn(ClosureCleaner.clean(projectionFn)) | ||
readFiles(ReadSupportFactory.typed[T], new SerializableConfiguration(configuration), cleanedFn) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not really sure the projectionFn
is doing what it is supposed to.
I checked in the ParquetReadFn
here and it seems to me it is the equivalent of a chained map
step.
IMHO we should either:
- simplify the API by removing projection and let user do a
map
post read (I am also surprised here we do not have an implicitCoder[R]
for the output type) - ask user to create a model
T
which is a projection of the model used when writing the parquet file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re: implicit Coder[R]
-- that would be required when the ReadFiles transform is actually applied, here: https://github.com/spotify/scio/blob/main/scio-core/src/main/scala/com/spotify/scio/values/SCollection.scala#L1396
(Maybe it makes sense to move these methods to the respective avro/typed SCollectionSyntax
traits, and have the method actually apply the transform to an SCollection? that way we could set the Coder...)
will circle back on the projectionFn
, I want to check something...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, so for the projection -- the main advantage of offering the projection, as I understand it, is that it happens pre-Coder#encode.
I think that for GenericRecords, this doesn't offer any advantage, as long as the user correctly specifies an implicit Coder using the projected schema -- see unit test.
But for SpecificRecords, say you have a schema with two non-nullable fields, but you only want to project one of those fields. Without being able to apply a projectionFn
pre-Coder#encode, serialization will fail due to the absence of a non-nullable field. I just updated the unit test to use Account
as a test record, which has non-nullable fields; the SpecificRecord test w/ no projectionFn is currently failing.
So overall, I think that:
a. ProjectionFn support for GenericRecord reads isn't important, and could be removed; however, I'm worried that users will be confused about setting the correct Coder here.
b. ProjectionFn support should be a requirement for SpecificRecord reads if your projection excludes non-nullable, non-default fields.
projection: Schema, | ||
projectionFn: T => R, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don're really like those 2 to be separated as they are strongly coupled.
If we want to offer projection, I think we better have to rework the Projection
helper so it encapsulate both the mapping function and the schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to rework the Parquet-Avro ReadParam to make this more clear! I think that it depends a bit if we'd like to include this change in a 0.12 release (I can revert the readFiles
signature changes to make MiMa happy), in which case it should be left as-is, or if we'd like to save it for 0.13.
0049ec2
to
2a4fa63
Compare
predicate: FilterPredicate, | ||
conf: Configuration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we create method overload to avoid setting those at null ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been thinking about that -- my concern is that overloading might make things more confusing from user POV. Say we have:
def readAvro[T <: SpecificRecord: ClassTag](): PTransform[PCollection[ReadableFile], PCollection[T]] =
readAvro[T](null, null, null)
def readAvro[T <: SpecificRecord: ClassTag](
projection: Schema,
predicate: FilterPredicate,
conf: Configuration
): PTransform[PCollection[ReadableFile], PCollection[T]] = ...
In this case, users can either pass 0 arguments, or all 3. But I think it's somewhat common use case to pass in only a projection, or only a predicate -- and that would be more difficult for the user if we went with the overloaded API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant we can create overloads for all possibilities
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 sounds good. thanks for updating the PR!
predicate: FilterPredicate, | ||
conf: Configuration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here ?
predicate: FilterPredicate, | ||
conf: Configuration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and here
Adds helper methods to create Parquet ReadFiles transforms.
Motivation: I realized it was really hard for users to construct a correct Parquet ReadFiles for Avro because the required configuration is so specific. So I added helpers, designed to be used in a similar way as Beam's AvroIO.readFiles:
feedback on the method naming/placement welcome -- I guess
ParquetRead.readAvroGenericRecordFiles
is kind of redundant?