Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial commit for GCS Batch Source plugin metadata feature. #1612

Open
wants to merge 1 commit into
base: release/2.9
Choose a base branch
from

Conversation

dvitiuk
Copy link

@dvitiuk dvitiuk commented Apr 29, 2022

This change extends functionality of GCS Batch Source plugin. There were added two new plugin parameters here:

  • Length Field
  • Modification Time Field

If a customer set values for these parameters, the plugin output schema will be extended with additional fields with appropriate names. In general the plugin works with these parameters in the same way as with "Path Field" parameter. Additionally was fixed "GET SCHEMA" functionality when "Path Field"/"Length Field"/"Modification Time Field" are set.

This change is supplied in two PRs, the second is .

There are open questions regarding the proposed change:

  1. There are 4 tests in google-cloud module which are failing even before my changes:
Failed tests: 
  BigtableSourceConfigTest.testValidateMissingProjectId:95->validateConfigValidationFail:207->validateConfigValidationFail:191 expected:<1> but was:<0>
  BigtableSinkConfigTest.testValidateMissingProjectId:86->validateConfigValidationFail:116 expected:<1> but was:<0>
  DataplexBatchSourceTest.validateServiceAccountWhenIsServiceAccountJsonFalse:107 expected:<0> but was:<2>
  DataplexBatchSourceTest.validateServiceAccountWhenServiceAccountFilePathIsNull:121 expected:<0> but was:<2>

Tests run: 394, Failures: 4, Errors: 0, Skipped: 9

Should they be fixed before PR or there is no difference?

  1. File Batch Source plugin had "GET SCHEMA" button for all formats in 2.7.1 version. But now this button is available for "delimited" format only. Does this functionality regress make any sense? (just faced this behavior during testing)

  2. Noticed that in fact "length" and "modification time" fields functionality was implemented for all "AbstractFileSource" batch source plugins. Does it make sense to add these fields to File Batch Source plugin?

  3. Initially all these changes were intended to create a customized GCS Source plugin which can be added to the current 6.5.1 CDAP cluster. As there were some changes of interfaces, it was required to support back-compatibility with unchanged version of File Batch Source and other possible plugins based on AbstractFileSource. Do we need this functionality now in future 6.7.0? (for example if we want use 6.7.0 CDAP with some old plugin based on AbstractFileSource and which uses old PathTrackingInputFormat.createRecordReader() without length and modification time support)

@google-cla
Copy link

google-cla bot commented Apr 29, 2022

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

For more information, open the CLA check for this pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant