(Databricks + Delta Tables)
Databricks engineering team & dbt Labs had created dbt-databricks adapter, using dbt-spark package, which allowed integration between dbt core and Databricks. For dbt Cloud, connection to Databricks is available via ODBC/JDBC. But now they have added dbt in Databricks’s Partner Connect which makes the integration easier and seamless. This allows users to use dbt’s data transformation framework to transform data in their Lakehouse.
Previously, connecting dbt Cloud to Databricks required multiple steps, including setting up credentials. With Partner Connect, it securely configures resources and sets up dbt Cloud as well. SQL Endpoint is automatically provisioned. Best practices are already baked into the configuration and Databricks passes the ID & SQL Endpoint automatically to dbt Cloud.
Databricks & dbt Labs have also further optimized open source dbt-databricks adapter, including using Delta tables by default, and merge for incremental models and snapshots. Further addition of features is planned on the roadmap that would leverage Databricks-specific capabilities to power benefits like faster metadata and docs generation via the Databricks catalogue features.
What is dbt?
dbt is a transformation tool in the ELT pipeline that lets teams transform data, following software engineering best practices. It helps data teams to improve transformation process by adding model referencing, lineage, documentation, and testing. Through native support for connectivity to Snowflake, Amazon Redshift, Google BigQuery, and PostgreSQL, dbt runs SQL queries against a database warehouse platform, or query engine to materialize data as tables and views. dbt does the T in ELT (Extract, Load, Transform) process — it doesn’t extract or load data, but only transform.
There are two offering from dbt Labs:
- dbt Core: free open source CLI version
- dbt Cloud: subscription based managed cloud version
The cloud version adds an extra interface layer on top of dbt and takes care of hosting. It also includes managed git repository and can schedule dbt transformation job for different environments.
In dbt, developers model tables, views & CTE by defining transformation in SQL Select statements. Jinja syntax is used in the SQL through which relationships are inferred from the models. This allows dbt to construct Direct Acyclic Graph (DAG), which is visualised in the documentation, providing data lineage information. In addition, this DAG opens the possibility to run data models that are independent in parallel. Additional model properties such as tests and descriptions are defined in YAML files.
dbt compiles the code in SQL and then materializes it in different forms in the target data warehouse or lakehouse. Developers focus on writing transformation statements that return the required results, to be stored in the table. Creating tables, writing inserts, upserts or creating snapshots is all done by dbt.
Connecting to dbt Cloud using Databricks Partner Connect
In Databricks workspace, Partner Connect can be accessed from the left sidebar. It lists all the available Partner integrations in Databricks, including dbt.
Clicking on dbt Tile opens the Connect to Partner window for dbt. Following info is populated:
- Databricks service principal, by default named DBT_CLOUD_USER
- Databricks SQL Endpoint, by default named DBT_CLOUD_ENDPOINT (default name can be changed)
- Login for dbt Cloud, that will host all projects
After providing all the information and once user logins, Partner Connect creates a Databricks personal access token and associates it with the DBT_CLOUD_USER service principal. It gives create-only permission to the DBT_CLOUD_USER service principal, only on the default catalog. If there is a requirement to read tables, then explicit read-only grants need to be given via Data Explorer to DBT_CLOUD_USER user.
After the project has been initialized in dbt Cloud, developers can start developing models. Bronze or raw layer in the Delta Lake can be referred as source tables in dbt models. For example, two table in retail catalog contain the raw data, as viewed from Data Explorer:
Source tables, containing raw data in Delta Lake, can be referred in dbt models, by defining source yml file in the project:
Using the raw data, silver/gold layers, containing staging, dimension, facts and aggregate models, can be defined in dbt.
dbt will determine the dependency for each model and generate the data lineage accordingly.
All models defined in dbt can be materialized in the Lakehouse by executing dbt run command. A Delta table/view is created for each model in the Delta Lake and can be viewed via Databricks Data Explorer.
Testing in dbt Model
In dbt, developers can define test cases for each model & columns within a model, to ensure quality of data. It supports 2 types of test cases:
- Singular tests: SQL query that returns failing rows,
- Generic tests: parametrized query that accepts arguments
It provides 4 generic tests cases out-of-box: uniqueness, null values and accepted values check and, valid foreign key relationships check. Pre-written tests are available in dbt_util package. In Databricks based project, spark_ultils package also needs to be added with dbt_util. Developers can write custom tests in the form of SQL select statements that can be implemented throughout the project.
As an example, following test cases are defined for each model. There are some tests that can be performed at the model level, such as dbt_utils.unique_combination_of_columns, which checks for uniqueness in a combination of columns, handy when a composite key has been defined, and there is a requirement to ensure correctness of granularity of data.
When we execute dbt test command, we can see that SQLs are generated for each test case, a spark connection created, test case SQL is executed using SQL Endpoint and results returned in dbt.
Performance of the queries compiled and issued by dbt to Databricks SQL endpoint can be checked via Query History in Databricks UI. It lists all the queries executed by Databricks. These queries can be filtered by user (DBT_CLOUD_USER) and SQL endpoint (DBT_CLOUD_ENDPOINT). By clicking on a query, users can view the SQL issued by dbt, along with processing stats.
Furthermore, query history also allows users to view the operations performed under the hood by Databricks to return the data.
dbt integration with Databricks is a game changer. Organisations can store all their data in one git-based tool. These transformations can contain the logic to build dimensional models in a data warehouse as well as build silver & gold layers in the Delta Lake. And we can trace data lineage across these different repositories, how good is that! All these transformations are written in SQL, along with test cases for each model to ensure data quality and up to date autogenerate documentation. Adoption of this modern data stack will definitely speed up development of lakehouse by quickly designing sustainable data repositories, containing high quality data.
Muhammad Sunil Bokhari – Head of Data and Analytics