-
Notifications
You must be signed in to change notification settings - Fork 513
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add prefix & suffix param for all IO APIs #4809
Conversation
ScioIO |
} | ||
|
||
final case class WriteParam( | ||
final case class WriteParam private ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you sure nobody is calling it directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This only restrict the new
constructor, the apply
stays public. I think it is for API evolution where we only would have to overload apply in the companion object.
Codecov Report
@@ Coverage Diff @@
## main #4809 +/- ##
==========================================
+ Coverage 62.54% 62.72% +0.18%
==========================================
Files 281 281
Lines 10431 10539 +108
Branches 781 775 -6
==========================================
+ Hits 6524 6611 +87
- Misses 3907 3928 +21
|
scio-avro/src/main/scala/com/spotify/scio/avro/syntax/ScioContextSyntax.scala
Outdated
Show resolved
Hide resolved
@@ -93,8 +91,22 @@ private[scio] object ScioUtil { | |||
|
|||
private def stripPath(path: String): String = StringUtils.stripEnd(path, "/") | |||
def strippedPath(path: String): String = s"${stripPath(path)}/" | |||
def pathWithPrefix(path: String, filePrefix: String): String = s"${stripPath(path)}/${filePrefix}" | |||
def pathWithPartPrefix(path: String): String = s"${stripPath(path)}/part" | |||
def pathWithPrefix(path: String, prefix: String): String = Option(prefix) match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ambivalent about allowing null
here
parts.head == 0 && parts.last + 1 == parts.size // xxxxx part | ||
} else { | ||
} else if (writtenShards.isEmpty) { | ||
// assume progress is complete when shard info is not retrieved and files are present |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this all we can do here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was the previous behavior. when user give a custom naming policy/shard template, we do not have this info the the `ReadParameter, thus we can't find shard numbers easily.
path = path, | ||
destinationFn = destinationFn, | ||
numShards = numShards, | ||
prefix = prefix, | ||
suffix = suffix, | ||
tempDirectory = tempDirectory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the provide the parameter names explicitly here? Did you mix up prefix/suffix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, Since API is not too strongly types, using named parameter helped to make sure I did not mess up with parameter ordering
val collections = Seq( | ||
"gs://bucket1/data/", | ||
"gs://bucket2/data/" | ||
).map(path => sc.avroFile[TestRecord](path, suffix=".avro")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the way we want to recommend users perform reads going forward? I marginally prefer the old version since it seems more compact & explicit, but that might just be Stockholm syndrome
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends. I think extension and format are strongly coupled. In term of job parameter, if the paths are taken from the args, extention matcher should not be artificially added. If the file format changes, We'd only have to update the job.
If the path is not a suffix matcher (eg. part-*
), it is fine to use path only.
Usin path + suffix is also more consistent with the SMB way, where the IO needs to read additional metadata from the path
21725ef
to
b501af1
Compare
This PR allows all file based IO to define either
prefix
+shardnameTemplate
orfilenamePolicySupplier
to create files.To make sure the
tap
returned by the write method is capturing written files, they must be adapted to match on the filesuffix
, only deterministic part of the filename.To do so, all file based IO read are give a
suffix
param, withnull
as default with the following behavior:path
as a folder with$path/*$suffix
path
s a pattern (contains a*
), an exception is thrown