diff --git a/docs/SUMMARY.md b/docs/SUMMARY.md index ebdd9437b..0ae495f69 100644 --- a/docs/SUMMARY.md +++ b/docs/SUMMARY.md @@ -23,6 +23,7 @@ * [MongoDB](dataSourcesAndSinks/mongodb.md) * [Neo4j](dataSourcesAndSinks/neo4j.md) * [Parquet](dataSourcesAndSinks/parquet.md) + * [BigQuery](dataSourcesAndSinks/bigquery.md) * [Running Zingg on Cloud](running/running.md) * [Running on AWS](running/aws.md) * [Running on Azure](running/azure.md) diff --git a/docs/dataSourcesAndSinks/bigquery.md b/docs/dataSourcesAndSinks/bigquery.md new file mode 100644 index 000000000..21b2975d5 --- /dev/null +++ b/docs/dataSourcesAndSinks/bigquery.md @@ -0,0 +1,70 @@ +## Using Google BigQuery to read and write data with Zingg + +Zingg can seemlessly work with Google BigQuery. Please find below details about the properties that must be set. + +The two driver jars namely **spark-bigquery-with-dependencies_2.12-0.24.2.jar** and **gcs-connector-hadoop2-latest.jar** are required to work with BigQuery. To include these BigQuery drivers to the spark classpath, set the following environment variable before running Zingg. + +```bash +export ZINGG_EXTRA=./spark-bigquery-with-dependencies_2.12-0.24.2.jar,./gcs-connector-hadoop2-latest.jar +export ZINGG_ARGS_EXTRA="--conf spark.hadoop.fs.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem" +``` + +If Zingg is run from outside Google cloud, it requires authentication, please set the following env variable to the location of the file containing service account key. A service account key can be created and downloaded in json format from [Google Cloud console](https://cloud.google.com/docs/authentication/getting-started) + +```bash +export GOOGLE_APPLICATION_CREDENTIALS=path to google service account key file +``` + +Connection properties for BigQuery as data source and data sink are given below. If you are curious to know more about how Spark connects to BigQuery, you may look at the [Spark BigQuery connector documentation](https://github.com/GoogleCloudDataproc/spark-bigquery-connector). + +### Properties for reading data from BigQuery: + +The property **"credentialsFile"** should point to the google service account key file location. This is the same path that is used to set variable **GOOGLE_APPLICATION_CREDENTIALS**. The **"table"** property should point to a BigQuery table that contains source data. The property **"viewsEnabled"** must be set to true only. + +```json + "data" : [{ + "name":"test", + "format":"bigquery", + "props": { + "credentialsFile": "/home/work/product/final/zingg-1/mynotification-46566-905cbfd2723f.json", + "table": "mynotification-46566.zinggdataset.zinggtest", + "viewsEnabled": true + } + }], +``` + + +### Properties for writing data to BigQuery: + +To write to BigQuery, a bucket needs to be created and assigned to the **"temporaryGcsBucket"** property. + +```json + "output" : [{ + "name":"output", + "format":"bigquery", + "props": { + "credentialsFile": "/home/work/product/final/zingg-1/mynotification-46566-905cbfd2723f.json", + "table": "mynotification-46566.zinggdataset.zinggOutput", + "temporaryGcsBucket":"zingg-test", + } + }], +``` + +### Notes: + * The library **"gcs-connector-hadoop2-latest.jar"** can be downloaded from [Google](https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-hadoop2-latest.jar) + and the library **"spark-bigquery-with-dependencies_2.12-0.24.2"** from [maven repo](https://repo1.maven.org/maven2/com/google/cloud/spark/spark-bigquery-with-dependencies_2.12/0.24.2/spark-bigquery-with-dependencies_2.12-0.24.2.jar) + * A typical service account key file look like below. format of the file is json. + + ```json + { + "type": "service_account", + "project_id": "mynotification-46566", + "private_key_id": "905cbfd273ff9205d1cabfe06fa6908e54534", + "private_key": "-----BEGIN PRIVATE KEY-----CERT.....", + "client_id": "11143646541283115487", + "auth_uri": "https://accounts.google.com/o/oauth2/auth", + "token_uri": "https://oauth2.googleapis.com/token", + "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs", + "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/zingtest%44mynotification-46566.iam.gserviceaccount.com" +} +``` \ No newline at end of file diff --git a/scripts/zingg.sh b/scripts/zingg.sh index 962a10895..cf1e057dd 100755 --- a/scripts/zingg.sh +++ b/scripts/zingg.sh @@ -12,4 +12,10 @@ else OPTION_JARS="--jars ${ZINGG_EXTRA}" fi -$SPARK_HOME/bin/spark-submit --master $SPARK_MASTER $OPTION_JARS --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.es.nodes="127.0.0.1" --conf spark.es.port="9200" --conf spark.es.resource="cluster/cluster1" --conf spark.default.parallelism="8" --conf spark.executor.extraJavaOptions="-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+HeapDumpOnOutOfMemoryError -Xloggc:/tmp/memLog.txt -XX:+UseCompressedOops" --conf spark.executor.memory=10g --conf spark.debug.maxToStringFields=200 --driver-class-path $ZINGG_JARS --class zingg.client.Client $ZINGG_JARS $@ --email $EMAIL --license $LICENSE +if [[ -z "${ZINGG_ARGS_EXTRA}" ]]; then + OPTION_ARGS="" +else + OPTION_ARGS="${ZINGG_ARGS_EXTRA}" +fi + +$SPARK_HOME/bin/spark-submit --master $SPARK_MASTER $OPTION_JARS $OPTION_ARGS --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.es.nodes="127.0.0.1" --conf spark.es.port="9200" --conf spark.es.resource="cluster/cluster1" --conf spark.default.parallelism="8" --conf spark.executor.extraJavaOptions="-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+HeapDumpOnOutOfMemoryError -Xloggc:/tmp/memLog.txt -XX:+UseCompressedOops" --conf spark.executor.memory=10g --conf spark.debug.maxToStringFields=200 --driver-class-path $ZINGG_JARS --class zingg.client.Client $ZINGG_JARS $@ --email $EMAIL --license $LICENSE