As per documentation there are four ways of accessing Azure Data Lake Storage Gen2:
- Pass your Azure Active Directory credentials, also known as credential passthrough
It requires Azure Databricks Premium Plan which is 2-3 times more expensive
And if you want to have this feature for all users you have to use High concurrency cluster
This scenario is good for per-person clusters and not so good for shared clusters
Example:
configs = {
"fs.azure.account.auth.type": "CustomAccessToken",
"fs.azure.account.custom.token.provider.class": spark.conf.get("spark.databricks.passthrough.adls.gen2.tokenProviderClassName")
}
dbutils.fs.mount(
source = "wasbs://cont1@kagstorageaccount1.blob.core.windows.net",
mount_point = "/mnt/cont1",
extra_configs = configs)
dbutils.fs.ls("/mnt/cont1")
-
Mount an Azure Data Lake Storage Gen2 filesystem to DBFS using a service principal and OAuth 2.0
Service principal allows to get rid of usage storage account access keys, and this allows to revoke access for particular Databricks without any changes on Storage Account side
This is the most secure way for shared clusters used for production purposes -
Use a service principal directly
The same as previous one, but mount is more convenient in most cases -
Use the Azure Data Lake Storage Gen2 storage account access key directly
Quick and dirty one, so good for test clusters without production data
If you want to revoke cluster access to storage you'll have to make a change on storage level and update other clusters
Examples:
configs = {
"fs.azure.account.key.kagsa2.blob.core.windows.net": "lRm***RBQ=="
}
dbutils.fs.mount(
source = "wasbs://cont1@kagstorageaccount1.blob.core.windows.net",
mount_point = "/mnt/cont1",
extra_configs = configs)
dbutils.fs.ls("/mnt/cont1")
..and:
configs = {
"fs.azure.account.key.kagstorageaccount1.blob.core.windows.net":dbutils.secrets.get(scope = "db-ws-cl1-secscope1", key = "kagstorageaccount1key")
}
dbutils.fs.mount(
source = "wasbs://cont1@kagstorageaccount1.blob.core.windows.net",
mount_point = "/mnt/cont1",
extra_configs = configs)
dbutils.fs.ls("/mnt/cont1")
Mount an Azure Data Lake Storage Gen2 filesystem to DBFS using a service principal and OAuth 2.0 in depth
- Create Service principal (Azure Portal > Azure Active Directory > App registrations > New registration > type name, e.g.
db-ws-sa1
> keep all other ) - Type name, e.g.
ws-sa1
and keep all other params as per default - Note "Application ID" and "Tenant ID"
- Create Secret and note it
- Don't assign any roles!
- For this demo purpose I'll create Resource group
db-ws-rg1
- Create Azure KeyVault in Resource group
- Enter name (e.g.
db-ws-kv1
), resource group to place in and keep all other settings default - Create secret named
db-ws-sa1-sec
and enterdb-ws-sa1
's Secret - Set expiration date to date when
db-ws-sa1
's Secret expires
- Create Storage account with name
dbwssa1
in Resource group - Use "Standard" performance, "StorageV2" type, "RA-GRS" replication, "Hot" tier and "Enabled" Hierarchical namespace
- Create container e.g. with name
cont1
- In Storage account IAM assign "Storage Blob Data Contributor" role to Service principal
ATTENTION: IAM role can't be assigned on Resource group or Subscription level, only on Storage account
ATTENTION (again): you can assign only "Storage Blob Data *" roles, no other stuff will work:
StatusCode=403
StatusDescription=This request is not authorized to perform this operation using this permission.
ErrorCode=AuthorizationPermissionMismatch
ErrorMessage=This request is not authorized to perform this operation using this permission.
If you enter wrong creds you'll get different error:
Body: {"error":"invalid_client","error_description":"AADSTS7000215: Invalid client secret is provided.\r\nTrace ID: 3b3c6e2a-45e0-450c-adc0-c01579341e00\r\nCorrelation ID: 5ad5d96e-79a6-4a53-acfb-776ede8ea885\r\nTimestamp: 2020-04-09 17:28:46Z","error_codes":[7000215],"timestamp":"2020-04-09 17:28:46Z","trace_id":"3b3c6e2a-45e0-450c-adc0-c01579341e00","correlation_id":"5ad5d96e-79a6-4a53-acfb-776ede8ea885","error_uri":"https://login.microsoftonline.com/error?code=7000215"}
- Run the following Scala code:
spark.conf.set("fs.azure.createRemoteFileSystemDuringInitialization", "true")
dbutils.fs.ls("abfss://<file-system-name>@<storage-account-name>.dfs.core.windows.net/")
spark.conf.set("fs.azure.createRemoteFileSystemDuringInitialization", "false")
- Create Databricks workspace in Resource group
- Enter name (e.g.
db-ws-cl1
), region and "Standard" for Tier - Create new cluster of "Standard" mode and other params as default
- Azure Key Vault-backed secret scope is more reliable and advanced than Databricks-backed scope
- Go to
https://<your_azure_databricks_url>#secrets/createScope
(more info here) - Enter Scope name, e.g.
db-ws-cl1-secscope1
- Select "All users" to manage Principal. If you want "Creator" to be the only who can manage it you have to use Premium tier. This is critical for Databricks-backed scope but not for Azure Key Vault-backed scope
- You cal check current list of scopes by running
databricks secrets list-scopes
- Create Python notebook:
configs = {"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": "c76b0be4-bbf8-42f7-8541-37f8df1f8914", # Service principal's Application ID
"fs.azure.account.oauth2.client.secret": dbutils.secrets.get(scope = "db-ws-cl1-secscope1", key = "db-ws-sa1-sec"), # Azure Key Vault-backed secret scope's name and Vault's secret name
"fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/a9f9369c-7127-4e16-b301-8de8e28b309c/oauth2/token", # Service principal's Tenant ID
"fs.azure.createRemoteFileSystemDuringInitialization": "true"}
dbutils.fs.mount(
source = "abfss://cont1@kagstorageaccount1.dfs.core.windows.net",
mount_point = "/mnt/cont1",
extra_configs = configs)
dbutils.fs.ls("/mnt/cont1")
- Unmount:
dbutils.fs.unmount("/mnt/cont1")