Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation for BigQuery connector #236

Merged
merged 2 commits into from
May 5, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@
* [MongoDB](dataSourcesAndSinks/mongodb.md)
* [Neo4j](dataSourcesAndSinks/neo4j.md)
* [Parquet](dataSourcesAndSinks/parquet.md)
* [BigQuery](dataSourcesAndSinks/bigquery.md)
* [Running Zingg on Cloud](running/running.md)
* [Running on AWS](running/aws.md)
* [Running on Azure](running/azure.md)
Expand Down
70 changes: 70 additions & 0 deletions docs/dataSourcesAndSinks/bigquery.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
## Using Google BigQuery to read and write data with Zingg

Zingg can seemlessly work with Google BigQuery. Please find below details about the properties that must be set.

The two driver jars namely **spark-bigquery-with-dependencies_2.12-0.24.2.jar** and **gcs-connector-hadoop2-latest.jar** are required to work with BigQuery. To include these BigQuery drivers to the spark classpath, set the following environment variable before running Zingg.

```bash
export ZINGG_EXTRA=./spark-bigquery-with-dependencies_2.12-0.24.2.jar,./gcs-connector-hadoop2-latest.jar
export ZINGG_ARGS_EXTRA="--conf spark.hadoop.fs.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem"
```

If Zingg is run from outside Google cloud, it requires authentication, please set the following env variable to the location of the file containing service account key. A service account key can be created and downloaded in json format from [Google Cloud console](https://cloud.google.com/docs/authentication/getting-started)

```bash
export GOOGLE_APPLICATION_CREDENTIALS=path to google service account key file
```

Connection properties for BigQuery as data source and data sink are given below. If you are curious to know more about how Spark connects to BigQuery, you may look at the [Spark BigQuery connector documentation](https://github.com/GoogleCloudDataproc/spark-bigquery-connector).

### Properties for reading data from BigQuery:

The property **"credentialsFile"** should point to the google service account key file location. This is the same path that is used to set variable **GOOGLE_APPLICATION_CREDENTIALS**. The **"table"** property should point to a BigQuery table that contains source data. The property **"viewsEnabled"** must be set to true only.

```json
"data" : [{
"name":"test",
"format":"bigquery",
"props": {
"credentialsFile": "/home/work/product/final/zingg-1/mynotification-46566-905cbfd2723f.json",
"table": "mynotification-46566.zinggdataset.zinggtest",
"viewsEnabled": true
}
}],
```


### Properties for writing data to BigQuery:

To write to BigQuery, a bucket needs to be created and assigned to the **"temporaryGcsBucket"** property.

```json
"output" : [{
"name":"output",
"format":"bigquery",
"props": {
"credentialsFile": "/home/work/product/final/zingg-1/mynotification-46566-905cbfd2723f.json",
"table": "mynotification-46566.zinggdataset.zinggOutput",
"temporaryGcsBucket":"zingg-test",
}
}],
```

### Notes:
* The library **"gcs-connector-hadoop2-latest.jar"** can be downloaded from [Google](https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-hadoop2-latest.jar)
and the library **"spark-bigquery-with-dependencies_2.12-0.24.2"** from [maven repo](https://repo1.maven.org/maven2/com/google/cloud/spark/spark-bigquery-with-dependencies_2.12/0.24.2/spark-bigquery-with-dependencies_2.12-0.24.2.jar)
* A typical service account key file look like below. format of the file is json.

```json
{
"type": "service_account",
"project_id": "mynotification-46566",
"private_key_id": "905cbfd273ff9205d1cabfe06fa6908e54534",
"private_key": "-----BEGIN PRIVATE KEY-----CERT.....",
"client_id": "11143646541283115487",
"auth_uri": "https://accounts.google.com/o/oauth2/auth",
"token_uri": "https://oauth2.googleapis.com/token",
"auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
"client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/zingtest%44mynotification-46566.iam.gserviceaccount.com"
}
```
8 changes: 7 additions & 1 deletion scripts/zingg.sh
Original file line number Diff line number Diff line change
Expand Up @@ -12,4 +12,10 @@ else
OPTION_JARS="--jars ${ZINGG_EXTRA}"
fi

$SPARK_HOME/bin/spark-submit --master $SPARK_MASTER $OPTION_JARS --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.es.nodes="127.0.0.1" --conf spark.es.port="9200" --conf spark.es.resource="cluster/cluster1" --conf spark.default.parallelism="8" --conf spark.executor.extraJavaOptions="-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+HeapDumpOnOutOfMemoryError -Xloggc:/tmp/memLog.txt -XX:+UseCompressedOops" --conf spark.executor.memory=10g --conf spark.debug.maxToStringFields=200 --driver-class-path $ZINGG_JARS --class zingg.client.Client $ZINGG_JARS $@ --email $EMAIL --license $LICENSE
if [[ -z "${ZINGG_ARGS_EXTRA}" ]]; then
OPTION_ARGS=""
else
OPTION_ARGS="${ZINGG_ARGS_EXTRA}"
fi

$SPARK_HOME/bin/spark-submit --master $SPARK_MASTER $OPTION_JARS $OPTION_ARGS --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.es.nodes="127.0.0.1" --conf spark.es.port="9200" --conf spark.es.resource="cluster/cluster1" --conf spark.default.parallelism="8" --conf spark.executor.extraJavaOptions="-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+HeapDumpOnOutOfMemoryError -Xloggc:/tmp/memLog.txt -XX:+UseCompressedOops" --conf spark.executor.memory=10g --conf spark.debug.maxToStringFields=200 --driver-class-path $ZINGG_JARS --class zingg.client.Client $ZINGG_JARS $@ --email $EMAIL --license $LICENSE