Access A Blob Storage Account from Databricks Environment
In this exercise I create a Databricks workspace in Azure portal. Then, I connect my workspace to a blob storage in my Azure account. To connect to the storage account I create a key vault in Azure portal and secret scope in Databricks. Next, I will use the keys to set up authentication for the pipeline.
Create a Databricks Workspace
Login to Azure portal ➛ create a new resource ➛ search for Databricks ➛ create.
In the configuration page for subscription choose the subscription you already have or choose Pay-as- you go. If you already have a resource group choose it from the dropdown list, or create a new one and specify a name for your resource group (e.g. dbpoject-rg). Give your workspace a name (e.g. dbpoject-ws). For the location choose the center which is closest to you, and for pricing tier choose Trial (Premium -14-Days Free DBUs), and click on Review + Create.
It will take few minutes till the workspace is deployed. Once the deployment is completed, go to your resource group where you can change settings as needed, access the overview, and launch the workspace.
Launch the workspace and it will take you to a new browser window and asks for Azure account authentication. The Databricks environment will open after signing in.
To begin working, create a cluster by clicking on Clusters in the left side pane.
Create a Cluster
In new window, click on the Create New Cluster, and set the configuration:
Cluster Name: give the cluster a name
Cluster Mode: set the cluster mode to Standard
High Concurrency: optimized to run concurrent SQL, Python, and R workloads, and does not support Scala.
Standard: is recommended for single-user clusters and can run SQL, Python, R, and Scala workloads.
Pool: keep a defined number of ready instances on standby to reduce cluster startup time. Leave it as None.
Databricks Runtime Version: Select the image that will be used to create the cluster. Set the runtime to Runtime 6.0 (Scala 2.11, Spark 2.4.3) which supports Python version 3..
Autopilot Options: creates a cluster that automatically scales between the minimum and maximum number of nodes, based on load. A cluster is considered inactive when all commands on the cluster have finished executing. Enable autoscaling and set it to terminate after 30 minutes of inactivity to save cost.
Worker Type: The worker types have accelerated data access through Delta caching. The Delta cache accelerates data reads by creating copies of remote files in nodes’ local storage using a fast intermediate data format. The data is cached automatically whenever a file has to be fetched from a remote location. Successive reads of the same data are then performed locally, which results in significantly improved reading speed. Set the Min Workers to 2 and Max Workers to 4.
DriverType: Leave the driver type to same as worker and hit the create button.
Once the cluster is created, it can be found under interactive clusters and will have a green dot next to its name.
Create a Notebook
The next step is to create a notebook. To create a new notebook, click on the Workspaces in the left side menu ➛ Users ➛ choose your user account ➛ Create ➛ Notebook. Write a name for the notebook, select Python as the language, pick the cluster you created earlier, and hit the create button.
Create a Secret Key
To access the blob storage in Databricks environment, we need a secret key and secret scope.
To create the secret key, go to Azure portal ➛ add new resource ➛ search for key vault ➛ click create. Once the key vault is created go to the key and from the left side menu choose Secret and click on generate a secret key.
For configuration leave the upload options to Manual, and for the Name and Value use your storage account’s name and access key .
Secret Scope
The secret scope must be created in Databricks environment. In the browser where the Databricks is open, go to the URL section and type secrets/createScope after the # in the link.
This will open the secret scope’s configuration page. Set a scope name, and choose All Users for manage principal. From your Azure key vault copy and paste the DNS name and resource ID. Hit the create button.
Mount Storage Account
After setting the access keys, mount the storage account in your Databrick environemen, using the secret scope name and secret key created earlier.
Go back to your notebook and use the following Python script to mount the container:
Wonderful article Faranak!! it was very helpful 🙂
Thank you