How to Deploy Databricks Notebooks through Repos API

Published by Siva Rajadurai on October 10, 2023

Overview

This repository contains notebooks & instructions for setting up the demo of development workflow & CI/CD (on Github Actions) using the Databricks notebooks and Repos feature.

The template repo is designed with enterprises in mind who would use Azure Service Principle to authenticate to Databricks and Github. Before you decide to use this template, please review this post and learn about other CI/CD options available for Databricks, its pros & cons: https://siva.blog/post/databricks-cicd-methods/

Get Started

Clone or fork this repository to your Github account. You can also use this repository as a template for your own repository.

git clone https://github.com/sivadotblog/databricks-template-repo

Link to Databricks Template Repo

If you’d like to contribute to this repository, please create a fork and submit a pull request.
If you find a bug or have a feature request, please create an issue.

Development workflow

The development workflow is organized as on following image:

Developer works on the code in databricks workspace within their personal repo. When code changes are done, they are committed into they own feature branch.
When the feature is ready, the developer creates a Pull Request to merge the changes into their non-production(dev,test,uat) branch of the personal repo.
CI/CD implementation (Github Actions here) picks up the changes and updates the non-production(dev,test,uat) branch to the databricks repos.
When the Pull Request is merged to main, a release tag is created and the CI/CD pipeline updates the production branch to the databricks repos.

Developer’s workflow

Set up your git integration under user-settings,
Clone the repo to your personal space in databricks workspace and start working on the code.

Setup on Databricks side

Your Databricks workspace needs to have Repos functionality enabled. If it’s enabled, you should see the "Repos" icon in the navigation panel:

Your Databricks Admin needs to create a high level Repos directory for your team. For example, /Repos/Dev for development, /Repos/Test for testing, and /Repos/Prod for production. The directory should be created with "Can Manage" permission for the team.

Setup on Azure side

You will need to create an azure service principle and grant necessary access to respective databricks workspace.

Setup on Github side

Create environment variables

Ideally, you will need to create multiple environments to store secrets and variables. For example, dev, stage and prod environments.

DATABRICKS_HOST – the URL of your workspace where tests will be executed (host name with https://, without ?o=, and without trailing slash character. For example: https://adb-4523452345452435.9.azuredatabricks.net).
REPO_DIRECTORY – the directory for staging checkout that we created above. For example, /Repos/Dev/databricks-template-repo.
AZURE_TENANT_ID – the ID of the Azure tenant where the service principal is created.
AZURE_CLIENT_ID – the ID of the service principal that will be used to authenticate to Azure.
AZURE_CLIENT_SECRET – the secret of the service principal that will be used to authenticate to Azure.
GH_TOKEN – the token that will be used to authenticate to GitHub. This token should have "repo" scope, and should be able to create tags in the repository.
GH_USERNAME – the username of the user that will be used to authenticate to GitHub. This user should have "admin" access to the repository.
GIT_PROVIDER – the name of the Git provider.

FAQ & Troubleshooting

What does the automated CI/CD pipeline do?

The automation does a bunch of things:

It checks if the repo already exists in the workspace. If not, it creates it.
It updated the repo to the latest version from the Github repo to the specified branch.
It also creates a new release tag in the repo, and updates the production branch to the latest version.

I’m getting "Error fetching repo ID for … Unauthorized access to Org…"

This usually happens when you’re trying to run CI/CD pipeline against a Databricks workspace with IP Access Lists enabled, and CI/CD server not in the allow list.