This post provides all the necessary information and instructions for setting up a development workflow and CI/CD (Continuous Integration/Continuous Deployment) using Databricks notebooks and the Repos feature.
Overview
This repository contains notebooks & instructions for setting up the demo of development workflow & CI/CD (on Github Actions) using the Databricks notebooks and Repos feature.
The template repo is designed with enterprises in mind who would use Azure Service Principle to authenticate to Databricks and Github. Before you decide to use this template, please review this post and learn about other CI/CD options available for Databricks, its pros & cons: https://siva.blog/post/databricks-cicd-methods/
Get Started
Clone or fork this repository to your Github account. You can also use this repository as a template for your own repository.
git clone https://github.com/sivadotblog/databricks-template-repo
Link to Databricks Template Repo
If you’d like to contribute to this repository, please create a fork and submit a pull request.
If you find a bug or have a feature request, please create an issue.
Development workflow
The development workflow is organized as on following image:
-
Developer works on the code in databricks workspace within their personal repo. When code changes are done, they are committed into they own feature branch.
-
When the feature is ready, the developer creates a Pull Request to merge the changes into their non-production(dev,test,uat) branch of the personal repo.
-
CI/CD implementation (Github Actions here) picks up the changes and updates the non-production(dev,test,uat) branch to the databricks repos.
-
When the Pull Request is merged to main, a release tag is created and the CI/CD pipeline updates the production branch to the databricks repos.
Developer’s workflow
-
Set up your git integration under user-settings,
-
Clone the repo to your personal space in databricks workspace and start working on the code.
Setup on Databricks side
Your Databricks workspace needs to have Repos functionality enabled. If it’s enabled, you should see the "Repos" icon in the navigation panel:
- Your Databricks Admin needs to create a high level Repos directory for your team. For example,
/Repos/Dev
for development,/Repos/Test
for testing, and/Repos/Prod
for production. The directory should be created with "Can Manage" permission for the team.
Setup on Azure side
You will need to create an azure service principle and grant necessary access to respective databricks workspace.
Setup on Github side
Create environment variables
Ideally, you will need to create multiple environments to store secrets and variables. For example, dev, stage and prod environments.
-
DATABRICKS_HOST
– the URL of your workspace where tests will be executed (host name withhttps://
, without?o=
, and without trailing slash character. For example:https://adb-4523452345452435.9.azuredatabricks.net
). -
REPO_DIRECTORY
– the directory for staging checkout that we created above. For example,/Repos/Dev/databricks-template-repo
. -
AZURE_TENANT_ID
– the ID of the Azure tenant where the service principal is created. -
AZURE_CLIENT_ID
– the ID of the service principal that will be used to authenticate to Azure. -
AZURE_CLIENT_SECRET
– the secret of the service principal that will be used to authenticate to Azure. -
GH_TOKEN
– the token that will be used to authenticate to GitHub. This token should have "repo" scope, and should be able to create tags in the repository. -
GH_USERNAME
– the username of the user that will be used to authenticate to GitHub. This user should have "admin" access to the repository. -
GIT_PROVIDER
– the name of the Git provider.
FAQ & Troubleshooting
What does the automated CI/CD pipeline do?
The automation does a bunch of things:
- It checks if the repo already exists in the workspace. If not, it creates it.
- It updated the repo to the latest version from the Github repo to the specified branch.
- It also creates a new release tag in the repo, and updates the production branch to the latest version.
I’m getting "Error fetching repo ID for … Unauthorized access to Org…"
This usually happens when you’re trying to run CI/CD pipeline against a Databricks workspace with IP Access Lists enabled, and CI/CD server not in the allow list.