Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NIFI-9259: Adding GeohashRecord Processor #5476

Merged
merged 10 commits into from
Dec 10, 2021
Merged

Conversation

mikayla-yang
Copy link
Contributor

@mikayla-yang mikayla-yang commented Oct 24, 2021

Thank you for submitting a contribution to Apache NiFi.

Please provide a short description of the PR here:

Description of PR

This PR adds a record-based processor that helps NiFi users encode latitude/longitude into Geohashes with desired format and precision, and decode Geohash values into latitude/longitude coordinates.

In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:

For all changes:

  • Is there a JIRA ticket associated with this PR? Is it referenced
    in the commit message?

  • Does your PR title start with NIFI-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.

  • Has your PR been rebased against the latest commit within the target branch (typically main)?

  • Is your initial contribution a single, squashed commit? Additional commits in response to PR reviewer feedback should be made on this branch and pushed to allow change tracking. Do not squash or use --force when pushing to allow for clean monitoring of changes.

For code changes:

  • Have you ensured that the full suite of tests is executed via mvn -Pcontrib-check clean install at the root nifi folder?
  • Have you written or updated unit tests to verify your changes?
  • Have you verified that the full build is successful on JDK 8?
  • Have you verified that the full build is successful on JDK 11?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE file, including the main LICENSE file under nifi-assembly?
  • If applicable, have you updated the NOTICE file, including the main NOTICE file found under nifi-assembly?
  • If adding new Properties, have you added .displayName in addition to .name (programmatic access) for each of the new properties?

For documentation related changes:

  • Have you ensured that format looks appropriate for the output in which it is rendered?

Note:

Please ensure that once the PR is submitted, you check GitHub Actions CI for build issues and submit an update to your PR as soon as possible.

Copy link
Contributor

@markap14 markap14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for contributing @mikayla-yang! The structure of the project, creating the nar, etc. all look good. Overall I think it's a good approach, but there are some conventions that we need to be sure that we're following, and we need to make sure that we are properly handling failure conditions, etc. Please see inline feedback.

}
}

if (updated) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this is written, the destination relationship for ALL records will be either 'success' or 'failure' depending on whether or not the last Record in the FlowFile is successful. We should route the original FlowFile to 'failure' if any Record cannot be enriched, OR we should route each individual Record either to 'success' or 'failure'. Or we should simply enrich the records that we could enrich and skip others. I would lean toward providing a property to allow the user to decide. I can definitely envision a use case where users want to say "I expect all records to be enriched. Otherwise, something is wrong and I want the whole FlowFile to be failed." And I can also imagine other use cases where the user will say "Some of the Records have the latitude/longitude elements and some don't. If the elements are there, enrich them, otherwise skip them." So a property gives the user the freedom to configure it as necessary for their use case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a ROUTING_STRATEGY property so that users can specify how to route the flowfile, which I believe covers all the user cases you mentioned:

  • Skip_Unenriched: "Some of the Records have the latitude/longitude elements and some don't. If the elements are there, enrich them, otherwise skip them."

  • ​​Require_All_Enriched: “I expect all records to be enriched. Otherwise, something is wrong and I want the whole FlowFile to be failed."

  • Split_Enriched_Unenriched: split the records that are enriched from those are not. The flowfile including successfully enriched records will be sent to ‘success’, the flowfile including records that are unable to be enriched will be sent to ‘failure’, and the original flowfile will be sent to ‘original’.

Just one quick note on the original code here, the destination relationship for all records will be 'success' as long as one record has been updated, which is identical to the current Skip_Unenriched strategy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mikayla-yang I'd like to recommend another alternative which I think would simplify a bit the approach here, yet still aligns with what @markap14 described. I think this can be similar to what was done in GeoEnrichIPRecord, by having a property that indicates whether or not processing should be "all or nothing" or to allow individual record processing. And instead of having the success or fail have a "found", "not found" and "original" relationships.

In the instance where it's all or nothing and we have one record that did not produce a geohash, the entire flow file would land on the "not found" relationship, otherwise it would land in "found". In the case where we'd support individual record processing, it would then ensure to split records into found or not found as needed. And it would always populate the original connection with the original payload no matter what option is selected.

What do you think of this? @markap14 does this align with the scenario you described? @mikayla-yang would this be something you could migrate to if @markap14 agrees it's a fit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @YolandaMDavis for your suggestion! I have posted two commits separately just to provide options. The last commit 97c5a5e is aligned with your comment: I tried to make it parallel to the GeoErichIPRecord as you suggested, and the only difference is our “all or nothing” strategy will route the flowfile to the not found relationship if any of the record is not enriched. The second to last commit c63b7e2 is aligned with the previous design, but I make sure we always route the original flowfiles to the original relationship under every routing strategy.

Copy link
Contributor

@exceptionfactory exceptionfactory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution @mikayla-yang! I noted some minor syntax recommendations, as well as one larger potential refactor for input and output handling.

Based on recommendations from @markap14, it sounds like some adjustments will be necessary to route the input file to an original relationship. Along that line, using a StreamCallback to bring together the InputStream and OutputStream handling seems like it would be helpful.

One other note, as indicated in the Pull Request checklist, new dependencies should mentioned in a NOTICE file under src/main/resources/META-INF in the NAR module. The geohash dependency is licensed under ASL 2.0, which is great, but the inclusion should be noted in this module, and in the nifi-assembly NOTICE.

@mikayla-yang
Copy link
Contributor Author

@markap14 @exceptionfactory Thank you for the comments! I have made changes according to your reviews.

@markap14 For handling the failure conditions, I made the change based on your suggestion and also referred to the logic of GeoEnrichIPRecord. I didn’t refactor the code to use the Map structure as we discussed because for the Require_All_Enriched routing strategy, we will not know which RecordSetWriter to use until we finish processing all records. However, using the Map structure requires us to know the final relationship every time we write a record in order to use a proper RecordSetWriter.

LISCENCE and NOTICE files have been added in the NAR module. The NOTICE under nifi-assembly has been updated as well.

Copy link
Contributor

@exceptionfactory exceptionfactory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates @mikayla-yang! I recommend adjusting the enum values to use all capital letters following standard Java conventions. I will take an additional look through the updated implementation.

@mikayla-yang
Copy link
Contributor Author

mikayla-yang commented Nov 2, 2021

Test failure fixed and successfully built with-Pcontrib-check

Copy link
Contributor

@exceptionfactory exceptionfactory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates @mikayla-yang. I noted a few more detail recommendations.

@mikayla-yang
Copy link
Contributor Author

Thank you @exceptionfactory for the comments! I have made changes accordingly.

@exceptionfactory
Copy link
Contributor

Thanks for the updates @mikayla-yang! The latest version looks close to completion.

With the recent release of NiFi 1.15.0, can you rebase and update the version references from 1.15.0-SNAPSHOT to 1.16.0-SNAPSHOT? That should correct the issue with the current automated build.

@mikayla-yang
Copy link
Contributor Author

Rebase and fixing the problem of routing unenriched empty flowfiles to success with SPLIT strategy in the upcoming commit...

Comment on lines 1 to 16
nifi-geohash-nar
Copyright 2021-2021 The Apache Software Foundation

This product includes software developed at
The Apache Software Foundation (http://www.apache.org/).

******************
Apache Software License v2
******************

The following binary components are provided under the Apache Software License v2

(ASLv2) Geohash Java
The following NOTICE information applies:
Geohash Java
Copyright 2016 Silvio Heuberger and contributors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not have any Apache licensed software going into the NOTICE file. We can remove this NOTICE file all together.

Copy link
Contributor

@markap14 markap14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mikayla-yang the code looks much better at this point. I mentioned a couple of very minor points about logging inline. And I think we have an extraneous NOTICE file. Otherwise the code looks pretty good. I'll do some testing of this tomorrow and if all looks good I'll be a +1.

@markap14 markap14 merged commit 24422c4 into apache:main Dec 10, 2021
@markap14
Copy link
Contributor

Thanks for all of your effort here @mikayla-yang ! All looks good to me now. And with a +1 from @exceptionfactory I've merged your changes to main!

krisztina-zsihovszki pushed a commit to krisztina-zsihovszki/nifi that referenced this pull request Jun 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants