Accessing Blob Storage Data from Databricks

Access A Blob Storage Account from Databricks Environment

In this exercise I create a Databricks workspace in Azure portal. Then, I connect my workspace to a blob storage in my Azure account.  To connect to the storage account I create a key vault in Azure portal and secret scope in Databricks.  Next, I will use the keys to set up authentication for the pipeline. 

Create a Databricks Workspace

Login to Azure portal ➛  create a new resource ➛ search for  Databricks ➛ create. 

In the configuration page for subscription choose the subscription you already have  or choose Pay-as- you go.  If you already have a resource group choose it from the dropdown list, or create a new one and specify a name for your resource group (e.g. dbpoject-rg). Give your workspace a name (e.g. dbpoject-ws). For the location choose the center which is closest to you, and for pricing tier choose  Trial (Premium -14-Days Free DBUs),  and click on Review + Create. 

It will take few minutes till the workspace is deployed. Once the deployment is completed, go to your resource group where you can change settings as needed, access the overview, and launch the workspace.

Launch the workspace and it will take you to a new browser window  and asks for Azure account authentication. The Databricks environment will open after signing in. 

To begin working, create a cluster  by clicking on Clusters in the left side pane.

Create a Cluster

In new window, click on the Create New Cluster, and set the configuration: 

  • Cluster Name: give the cluster a name 
  • Cluster Mode: set the cluster mode to Standard
    • High Concurrency: optimized to run concurrent SQL, Python, and R workloads, and does not support Scala.
    • Standard: is recommended for single-user clusters and can run SQL, Python, R, and Scala workloads. 
  • Pool:  keep a defined number of ready instances on standby to reduce cluster startup time.  Leave it as None.
  • Databricks Runtime Version: Select the image that will be used to create the cluster. Set the runtime to Runtime 6.0 (Scala 2.11, Spark 2.4.3) which supports Python version 3..
  • Autopilot Options:  creates a cluster that automatically scales between the minimum and maximum number of nodes, based on load.  A cluster is considered inactive when all commands on the cluster have finished executing.  Enable autoscaling and set it to terminate after 30 minutes of inactivity to save cost.
  • Worker Type: The worker types have accelerated data access through Delta caching. The Delta cache accelerates data reads by creating copies of remote files in nodes’ local storage using a fast intermediate data format. The data is cached automatically whenever a file has to be fetched from a remote location. Successive reads of the same data are then performed locally, which results in significantly improved reading speed. Set the Min Workers to 2 and Max Workers to 4.
  • DriverType: Leave the driver type to same as worker and hit the create button.

Once the cluster is created, it can be found under interactive clusters and will have a green dot next to its name. 

Create a Notebook

The next step is to create a notebook. To create a new notebook, click on the Workspaces in the left side menu ➛ Users ➛  choose your user account ➛ Create ➛ Notebook.  Write a name for the notebook, select Python as the language,  pick the cluster you created earlier, and hit the create button. 

Create a Secret Key

To access the blob storage in Databricks  environment, we need a secret key and secret scope.

To create the secret key, go to Azure portal  add new resource search for key vault click create. Once the key vault is created go to the key and from the left side menu choose Secret and click on generate a secret key. 

For configuration leave the upload options to Manual, and for the Name and  Value use your storage account’s  name and access key . 

Secret Scope

The secret scope must be created in Databricks environment. In the browser where the Databricks  is open, go to the URL section  and type secrets/createScope after the # in the link. 

This will open the secret scope’s configuration page. Set a scope name, and choose All Users for manage principal.  From your Azure key vault  copy and paste the DNS name and resource ID. Hit the create button.

Mount Storage Account

After setting the access keys, mount the storage account  in your Databrick environemen, using the secret scope name and secret key created earlier. 

Go back to your notebook and use the following  Python script to mount the container: 

dbutils.fs.mount(source = "wasbs://dbpcontainer@dbprojectstorageaccount.blob.core.windows.net",
                 mount_point = "/mnt/", extra_configs = {"fs.azure.account.key.dbprojectstorageaccount.blob.core.windows.net":dbutils.secrets.get(scope = "dbpSecretScope",
                 key = "dbprojectstorageaccount")}) 

Use the display (dbutils.fs.ls())  code to see the path to the  file system, and use the path to import the dataset as data frame.

Explore the Data

Using sql, create a temporary view L and call for tables within your dataset.  

Now you are connected to your storage account, can explore your dataset,  and perform analyses and build visualization.  

Share this page

Share on twitter
Share on linkedin

2 Replies to “Accessing Blob Storage Data from Databricks”

Leave a Reply

Your email address will not be published.