[BEAM-11322] Apache Beam Example to tokenize sensitive data #13995

KhaninArtur · 2021-02-16T15:52:35Z

Some users may want to protect their sensitive data using tokenization.

We propose to create a Beam example template that will demonstrate Beam transform to protect sensitive data using tokenization. In our example, we use an external service for the data tokenization.

At a high level, a pipeline that:

supports batch (GCS) and streaming (Pub/Sub) input sources
tokenizes sensitive data via external REST service - we are about to use Protegrity
outputs tokenized data into BigQuery or BigTable

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Choose reviewer(s) and mention them in a comment (R: @username).
Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
Update CHANGES.md with noteworthy changes.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

Post-Commit Tests Status (on master branch)

Lang	Dataflow	Samza	Twister2
Go	---	---	---
Java
Python		---	---
XLang		---	---

Pre-Commit Tests Status (on master branch)

---	Java	Python	Go	Website	Whitespace	Typescript
Non-portable
Portable	---		---	---	---	---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

GitHub Actions Tests Status (on master branch)

See CI.md for more information about GitHub Actions CI.

…taTokenizationExample # Conflicts: # examples/java/src/main/java/org/apache/beam/examples/complete/datatokenization/DataTokenization.java # examples/java/src/main/java/org/apache/beam/examples/complete/datatokenization/transforms/DataProtectors.java # examples/java/src/main/java/org/apache/beam/examples/complete/datatokenization/utils/JavascriptTextTransformer.java

…o DataTokenizationExample � Conflicts: � examples/java/src/main/java/org/apache/beam/examples/complete/datatokenization/transforms/DataProtectors.java

…Utils

…taTokenizationExample

# Conflicts: # examples/java/build.gradle

…icense analysis

pabloem · 2021-02-25T22:21:26Z

hi @KhaninArtur ! thanks for your contribution - is this PR ready to be reviewed?

KhaninArtur · 2021-02-26T08:19:50Z

Hi @pabloem! Yes, this PR is ready to be reviewed 👍🏼

KhaninArtur · 2021-03-08T14:48:10Z

Hi, @pabloem!
Are there any updates regarding this PR?

pabloem

I apologize about the long delay. I took a quick superficial look - and I added some actionable comments. I'm happy to iterate on this.
Thanks!

pabloem · 2021-03-08T21:58:42Z

examples/java/src/main/java/org/apache/beam/examples/complete/datatokenization/README.md

+```bash
+gradle clean execute -DmainClass=org.apache.beam.examples.complete.datatokenization.DataTokenization \
+     -Dexec.args="--<argument>=<value> --<argument>=<value>"
+```
+


these instructions are nice and useful. I worry that users will not find out about this example. Do you have plans to blog about it, or add any extra documentation for it?

Yes, we plan to spread the word and blog about it.

pabloem · 2021-03-08T21:59:47Z

examples/java/src/main/java/org/apache/beam/examples/complete/datatokenization/README.md

+## Running as a Dataflow Template
+
+This example also exists as Google Dataflow Template, which you can build and run using Google Cloud Platform. See
+this template documentation [README.md](https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/master/v2/protegrity-data-tokenization/README.md) for


I don't see anything in this address. Will it be added later?

I am curious how the template here and in DataflowTemplates will be different?

Yes, we have a PR in the DataflowTemplates repository and it is focused on execution via Google Cloud Dataflow runner. The Beam version, on the other hand, is a bit more generic.

pabloem · 2021-03-10T22:48:19Z

...main/java/org/apache/beam/examples/complete/datatokenization/transforms/io/FileSystemIO.java

+import org.slf4j.LoggerFactory;
+
+/** The {@link FileSystemIO} class to read/write data from/into File Systems. */
+public class FileSystemIO {


is this class meant to be generic? Or specific for this template? I see the class is within the template package - I am just wondering if we should name the class TokenizationFileIO or something that states clearly that these transforms are meant only to be used for the data tokenization template?

Got you, renamed it.

pabloem · 2021-03-10T22:48:33Z

...c/main/java/org/apache/beam/examples/complete/datatokenization/transforms/io/BigTableIO.java

+import org.slf4j.LoggerFactory;
+
+/** The {@link BigTableIO} class for writing data from template to BigTable. */
+public class BigTableIO {


Same question as with FileSystemIO

pabloem · 2021-03-10T22:48:43Z

...c/main/java/org/apache/beam/examples/complete/datatokenization/transforms/io/BigQueryIO.java

+import org.slf4j.LoggerFactory;
+
+/** The {@link BigQueryIO} class for writing data from template to BigTable. */
+public class BigQueryIO {


Same as FileSystemIO

pabloem · 2021-03-10T22:50:52Z

.../main/java/org/apache/beam/examples/complete/datatokenization/transforms/DataProtectors.java

+    @ProcessElement
+    public void process(
+        ProcessContext context,
+        BoundedWindow window,
+        @StateId("buffer") BagState<Row> bufferState,
+        @StateId("count") ValueState<Integer> countState,
+        @TimerId("expiry") Timer expiryTimer) {
+
+      expiryTimer.set(window.maxTimestamp());
+
+      int count = firstNonNull(countState.read(), 0);
+      count++;
+      countState.write(count);
+      bufferState.add(context.element().getValue());
+
+      if (count >= batchSize) {
+        processBufferedRows(bufferState.read(), context);
+        bufferState.clear();
+        countState.clear();
+      }
+    }
+
+    @SuppressWarnings("argument.type.incompatible")
+    private void processBufferedRows(Iterable<Row> rows, WindowedContext context) {


I see that this DoFn does a lot of its own buffering. Have you considered using GroupIntoBatches[1] for this? GroupIntoBatches has the same sort of buffering/counting/timer emission logic, but it receives more updates (e.g. it recently started supporting 'autosharding' which lets runners decouple the number of shards from the number of keys).

Think about it - but I recommend you try using GroupIntoBatches, as you will get some extra nice benefits from it.

[1] https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/transforms/GroupIntoBatches.html

Thank you so much for your suggestion, @pabloem! GroupIntoBatches works really great, replaced Stateful DoFn with it.

* GroupIntoBatches was used in the data tokenization pipeline * io files were renamed for the data tokenization template

KhaninArtur · 2021-03-22T10:17:19Z

retest this please

KhaninArtur · 2021-03-22T11:07:08Z

retest this please

pabloem · 2021-03-22T17:13:28Z

taking a look today

…Information about it added to README (#14) Getting value from environment variables for maxBufferingDurationMs

* Fix bug incorrect DSG url lead to NPE DATAFLOW-139

Artur Khanin and others added 28 commits January 11, 2021 12:33

[WIP] Transfer from DataflowTemplates to Beam

45ee19f

Move to beam repo

ba046cb

moved convertors for GCSio

df5881f

Merge branch 'master' into DataTokenizationExample

e6652ab

Renaming + readme

1e9c336

build errors

76accdc

minimize suppress

8782cba

remove UDF usages

46805f3

Fixes for stylechecks

d9a8e33

grooming for data protectors

76af970

Merge branch 'DataTokenizationExample' of github.com:akvelon/beam int…

e556e37

…o DataTokenizationExample � Conflicts: � examples/java/src/main/java/org/apache/beam/examples/complete/datatokenization/transforms/DataProtectors.java

grooming for data protectors

bfcba7f

fix javadoc

cfa41a7

Added support for window writing; Fixed ordering in tokenization process

c8200f4

supressed checkstyle errors for BigTableIO class

6e8f62f

add data tokenization tests

b489cc7

Changed GCS to FileSystem and removed redundant function from Schemas…

ee93d14

…Utils

Updated README.md for local run with BigQuery sink

4b59ed9

add docstring

2e0efe0

Updated README.md

3136249

Updated README.md and added javadoc for the main pipeline class

606ecb2

remove unused test case

2b30240

Merge remote-tracking branch 'origin/DataTokenizationExample' into Da…

329c3e6

…taTokenizationExample

Merge branch 'master' into DataTokenizationExample

341ae1e

# Conflicts: # examples/java/build.gradle

Style fix

3002f7c

Whitespaces fix

7b7664e

Fixed undeclared dependencies and excluded .csv resource files from l…

aeb471e

…icense analysis

pabloem requested changes Mar 10, 2021

View reviewed changes

Fix for incorrect rpc url

a923846

MikhailMedvedevAkvelon force-pushed the DataTokenizationExample branch from 6e22dfc to a923846 Compare March 18, 2021 14:22

MikhailMedvedevAkvelon and others added 3 commits March 18, 2021 19:37

Fix for nullable types

5a5d0c4

Data tokenization example group into batches (#11)

6ac4193

* GroupIntoBatches was used in the data tokenization pipeline * io files were renamed for the data tokenization template

code format fixed

906e6e7

KhaninArtur requested a review from pabloem March 22, 2021 13:01

Nuzhdina-Elena and others added 2 commits March 23, 2021 10:50

Getting value from environment variables for maxBufferingDurationMs. …

9ad7e12

…Information about it added to README (#14) Getting value from environment variables for maxBufferingDurationMs

[DATAFLOW-139] Incorrect DSG url lead to NPE (#13)

3ab9289

* Fix bug incorrect DSG url lead to NPE DATAFLOW-139

pabloem approved these changes Mar 31, 2021

View reviewed changes

pabloem merged commit fd3075f into apache:master Mar 31, 2021

ilya-kozyrev mentioned this pull request Apr 13, 2021

Data protection template using Protegrity GoogleCloudPlatform/DataflowTemplates#210

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BEAM-11322] Apache Beam Example to tokenize sensitive data #13995

[BEAM-11322] Apache Beam Example to tokenize sensitive data #13995

KhaninArtur commented Feb 16, 2021

pabloem commented Feb 25, 2021

KhaninArtur commented Feb 26, 2021

KhaninArtur commented Mar 8, 2021

pabloem left a comment

pabloem Mar 8, 2021

KhaninArtur Mar 16, 2021

pabloem Mar 8, 2021

pabloem Mar 10, 2021

KhaninArtur Mar 16, 2021

pabloem Mar 10, 2021

KhaninArtur Mar 22, 2021

pabloem Mar 10, 2021

KhaninArtur Mar 22, 2021

pabloem Mar 10, 2021

KhaninArtur Mar 22, 2021

pabloem Mar 10, 2021

KhaninArtur Mar 22, 2021

KhaninArtur commented Mar 22, 2021

KhaninArtur commented Mar 22, 2021

pabloem commented Mar 22, 2021

[BEAM-11322] Apache Beam Example to tokenize sensitive data #13995

[BEAM-11322] Apache Beam Example to tokenize sensitive data #13995

Conversation

KhaninArtur commented Feb 16, 2021

Post-Commit Tests Status (on master branch)

Pre-Commit Tests Status (on master branch)

GitHub Actions Tests Status (on master branch)

pabloem commented Feb 25, 2021

KhaninArtur commented Feb 26, 2021

KhaninArtur commented Mar 8, 2021

pabloem left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KhaninArtur commented Mar 22, 2021

KhaninArtur commented Mar 22, 2021

pabloem commented Mar 22, 2021