Skip to content
This repository has been archived by the owner on Nov 11, 2022. It is now read-only.

KafkaIO should return one split for each of partition. #491

Merged
merged 6 commits into from
Dec 10, 2016

Conversation

rangadi
Copy link
Contributor

@rangadi rangadi commented Dec 1, 2016

KafkaIO should return one split for each of partition.

This is the actual unit of parallelism for Kafka topic. desiredNumSplits that Dataflow passes to a custom source is very low when maxNumWorkers is set. It asks for
just one split for each of the workers. This limits use of CPU cores on the workers essentially making autoscaling use more resources without improving performance.

This includes a hack to force single split in many unit tests since DirectPipelineRunner and InProcessPipelineRunner don't seem to read from more than one split.

This is the actual unit of parallelism for Kafka topic.
desiredNumSplits that Dataflow passes to a custom source
is not very low when maxNumWorkers is set, it asks for
just one split for each of the workers. This limits
use of CPU cores on the workers essentially making
autoscaling use more resources without improving peformance.

This includes a hack to force single split in many unit tests
since DirectPipelineRunner and InProcessPipelineRunner don't
seem to read from more than one split.
@rangadi
Copy link
Contributor Author

rangadi commented Dec 1, 2016

+R: @davorbonaci, @dhalperi, @tgroh

DirectPipelineRunner does not call getInitialSplits().
Rather than forcing single split through a special config, force it
when it invoked from within KafkaIO itself.
@rangadi
Copy link
Contributor Author

rangadi commented Dec 1, 2016

Update based on Thomas comment on chat.
DirectPipelineRunner does not call getInitialSplits().
Rather than forcing single split through a special config, force it
when it invoked from within KafkaIO itself.

}
for (int i = 0; i < partitions.size(); i++) {
assignments.get(i % numSplits).add(partitions.get(i));
if (desiredNumSplits < 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be inlined? More specifically, you could factor out the partitions preprocessing, and then just call the constructor in generateInitialSplits

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not sure exactly what you meant.. I minimized the diff by reusing old code that handles the generic case where number of partitions and splits might not match. PTAL.

@rangadi
Copy link
Contributor Author

rangadi commented Dec 8, 2016

I was not sure exactly what you meant.. I minimized the diff by reusing old code that handles the generic case where number of partitions and splits might not match. PTAL.

@rangadi
Copy link
Contributor Author

rangadi commented Dec 9, 2016

Updated after a clarification from Thomas. It makes sense. There is no special case for single split in generateInitialSplits(). createReader() creates single reader if there aren't any partitions assigned (as happens with direct runner).

Updated couple of javadoc comments as well.

@tgroh
Copy link
Contributor

tgroh commented Dec 9, 2016

[ERROR] src/test/java/com/google/cloud/dataflow/contrib/kafka/KafkaIOTest.java:[29,8] (imports) UnusedImports: Unused import: com.google.cloud.dataflow.sdk.options.PipelineOptionsFactory.

Otherwise LGTM

@rangadi
Copy link
Contributor Author

rangadi commented Dec 9, 2016

Thanks. Just pushed the fix for unused import. I will ping once travis-ci is happy.

@rangadi
Copy link
Contributor Author

rangadi commented Dec 9, 2016

@tgroh all the checks passed. Thanks for the review.

@tgroh tgroh merged commit 4e8f101 into GoogleCloudPlatform:master Dec 10, 2016
rangadi pushed a commit to rangadi/DataflowJavaSDK that referenced this pull request Dec 14, 2016
Recent PR GoogleCloudPlatform#491 changes how KafkaIO splits. This makes it incompatible
with Dataflow update across these two versions.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants