Skip to main content

Analytics craft

The art of being an analytics practitioner.

View All Tags

Getting Started with git Branching Strategies and dbt

· 31 min read
Christine Berger
Carol Ohms
Taylor Dunlap
Steve Dowling

Hi! We’re Christine and Carol, Resident Architects at dbt Labs. Our day-to-day work is all about helping teams reach their technical and business-driven goals. Collaborating with a broad spectrum of customers ranging from scrappy startups to massive enterprises, we’ve gained valuable experience guiding teams to implement architecture which addresses their major pain points.

The information we’re about to share isn't just from our experiences - we frequently collaborate with other experts like Taylor Dunlap and Steve Dowling who have greatly contributed to the amalgamation of this guidance. Their work lies in being the critical bridge for teams between implementation and business outcomes, ultimately leading teams to align on a comprehensive technical vision through identification of problems and solutions.

Why are we here?
We help teams with dbt architecture, which encompasses the tools, processes and configurations used to start developing and deploying with dbt. There’s a lot of decision making that happens behind the scenes to standardize on these pieces - much of which is informed by understanding what we want the development workflow to look like. The focus on having the perfect workflow often gets teams stuck in heaps of planning and endless conversations, which slows down or even stops momentum on development. If you feel this, we’re hoping our guidance will give you a great sense of comfort in taking steps to unblock development - even when you don’t have everything figured out yet!

There are three major tools that play an important role in dbt development:

  • A repository
    Contains the code we want to change or deploy, along with tools for change management processes.
  • A data platform
    Contains data for our inputs (loaded from other systems) and databases/schemas for our outputs, as well as permission management for data objects.
  • A dbt project
    Helps us manage development and deployment processes of our code to our data platform (and other cool stuff!)
dbt's relationship to git and the data platformdbt's relationship to git and the data platform

No matter how you end up defining your development workflow, these major steps are always present:

  • Development: How teams make and test changes to code
  • Quality Assurance: How teams ensure changes work and produce expected outputs
  • Promotion: How teams move changes to the next stage
  • Deployment: How teams surface changes to others

This article will be focusing mainly on the topic of git and your repository, how code corresponds to populating your data platform, and the common dbt configurations we implement to make this happen. We’ll also be pinning ourselves to the steps of the development workflow throughout.

Why focus on git?

Source control (and git in particular) is foundational to modern development with or without dbt. It facilitates collaboration between teams of any size and makes it easy to maintain oversight of the code changes in your project. Understanding these controlled processes and what code looks like at each step makes understanding how we need to configure our data platform and dbt much easier.

⭐️ How to “just get started” ⭐️

This article will be talking about git topics in depth — this will be helpful if your team is familiar with some of the options and needs help considering the tradeoffs. If you’re getting started for the first time and don’t have strong opinions, we recommend starting with Direct Promotion.

Direct Promotion is the foundation of all git branching strategies, works well with basic git knowledge, requires the least amount of provisioning, and can easily evolve into another strategy if or when your team needs it. We understand this recommendation can invoke some thoughts of “what if?”. We urge you to think about starting with direct promotion like getting a suit tailored. Your developers can wear it while you’re figuring out the adjustments, and this is a much more informative step forward because it allows us to see how the suit functions in motion — our resulting adjustments can be starkly different than what we thought we’d need when it was static.

The best part with ‘just getting started’ is that it’s not hard to change configurations in dbt for your git strategy later on (and we'll cover this), so don’t think of this as a critical decision that will that will result in months of breaking development for re-configuration if you don’t get it right immediately. Truly, changing your git strategy can be done in a matter of minutes in dbt Cloud.

Branching Strategies

Once a repository has its initial commit, it always starts with one default branch which is typically called main or master — we’ll be calling the default branch main in our upcoming examples. The main branch is always the final destination that we’re aiming to land our changes, and most often corresponds to the term "production" - another term you'll see us use throughout.

How we want our workflow to look getting our changes from development to main is the big discussion. Our process needs to consider all the steps in our workflow: development, quality assurance, promotion, and deployment. Branching Strategies define what this process looks like. We at dbt are not reinventing the wheel - a number of common strategies have already been defined, implemented, iterated on, and tested for at least a decade.

There are two major strategies that encompass all forms of branching strategies: Direct Promotion and Indirect Promotion. We’ll start by laying these two out simply:

  • What is the strategy?
  • How does the development workflow of the strategy look to a team?
  • Which repository branching rules and helpers help us in this strategy?
  • How do we commonly configure dbt Cloud for this strategy?
  • How do branches and dbt processes map to our data platform with this strategy?

Then, we’ll end by comparing the strategies and covering some frequently asked questions.

Know before you go

There are many ways to configure each tool (especially dbt) to accomplish what you need. The upcoming strategy details were written intently to provide what we think are the minimal standards to get teams up and running quickly. These are starter configurations and practices which are easy to tweak and adjust later on. Expanding on these configurations is and exercise left to the reader!

Direct Promotion

Direct promotion means we only keep one long-lived branch in our repository — in our case, main. Here’s the workflow for this strategy:

Direct promotion branching strategyDirect promotion branching strategy

How does the development workflow look to a team?

Layout:

  • feature is the developer’s unique branch where task-related changes happen
  • main is the branch that contains our “production” version of code

Workflow:

  • Development: I create a feature branch from main to make, test, and personally review changes
  • Quality Assurance: I open a pull request comparing my feature against main, which is then reviewed by peers (required), stakeholders, or subject matter experts (SMEs). We highly recommend including stakeholders or SMEs for feedback during PR in this strategy because the next step changes main.
  • Promotion: After all required approvals and checks, I merge my changes to main
  • Deployment: Others can see and use my changes in main after I merge and main is deployed

Repository Branching Rules and Helpers

At a minimum, we like to set up:

dbt Cloud Processes and Environments

Here’s our branching strategy again, but now with the dbt Cloud processes we want to incorporate:

Direct Promotion strategy with dbt cloud processes denotedDirect Promotion strategy with dbt cloud processes denoted

In order to create the jobs in our diagram, we need dbt Cloud environments. Here are the common configurations for this setup:

Environment NameEnvironment TypeDeployment TypeBase BranchWill handle…
Developmentdevelopment-mainOperations done in the IDE (including creating feature branches)
Continuous IntegrationdeploymentGeneralmainA continuous integration job
ProductiondeploymentProductionmainA deployment job

Data Platform Organization

Now we need to focus on where we want to build things in our data platform. For that, we need to set our database and schema settings on the environments. Here’s our diagram again, but now mapping how we want our objects to populate from our branches to our data platform:

Direct Promotion strategy with branch relations to data platform objectsDirect Promotion strategy with branch relations to data platform objects

Taking the table we created previously for our dbt Cloud environment, let's further map environment configurations to our data platform:

Environment NameDatabaseSchema
DevelopmentdevelopmentUser-specified in Profile Settings > Credentials
Continuous IntegrationdevelopmentAny safe default, like dev_ci (it doesn’t even have to exist). The job we intend to set up will override the schema here anyway to denote the unique PR.
Productionproductionanalytics
note

We are showing environment configurations here, but a default database will be set at the highest level in a connection (which is a required setting of an environment). Deployment environments can override a connection's database setting when needed.

Direct Promotion Example

In this example, Steve uses the term “QA” for defining the environment which builds the changed code from feature branch pull requests. This is equivalent to our ‘Continuous Integration’ environment — this is a great example of defining names which make the most sense for your team!

Indirect Promotion

A note about Indirect Promotion

Indirect Promotion introduces more steps of ownership, so this branching strategy works best when you can identify people who have a great understanding of git to handle branch management. Additionally, the time from development to production is lengthier due to the workload of these new steps, so it requires good project management. We expand more on this later, but it’s an important call out as this is where we see unprepared teams struggle most.

Indirect promotion adds other long-lived branches that derive from main. The most simple version of indirect promotion is a two-trunk hierarchical structure — this is the one we see implemented most commonly in indirect workflows.

Hierarchical promotion is promoting changes back the same way we derived the branches. Example:

  • a middle branch is derived from main
  • feature branches derive from the middle branch
  • feature branches merge back to the middle branch
  • the middle branch merges back to main

Some common names for a middle branch as seen in the wild are:

  • qa : Quality Assurance
  • uat : User Acceptance Testing
  • staging or preprod : Common software development terminology

We’ll be calling our middle branch qa from throughout the rest of this article.

Here’s the workflow for this strategy:

Indirect Promotion branching strategyIndirect Promotion branching strategy

How does the development workflow look to a developer?

Changes from our direct promotion workflow are highlighted in blue.

Layout:

  • feature is the developer’s unique branch where task-related changes happen
  • qa contains approved changes from developers’ feature branches, which will be merged to main and enter production together once additional testing is complete.qa is always ahead of main in changes.
  • main is the branch that contains our “production” version of code

Workflow:

  • Development: I create a feature branch from qa to make, test, and personally review changes
  • Quality Assurance: I open a pull request comparing my feature branch to qa, which is then reviewed by peers and optionally subject matter experts or stakeholders
  • Promotion: After all required approvals and checks, I can merge my changes to qa
  • Quality Assurance: SMEs or other stakeholders can review my changes in qa when I merge my feature
  • Promotion: Once QA specialists give their approval of qa’s version of data, a release manager opens a pull request using qa’s branch targeting main (we define this as a “release”)
  • Deployment: Others can see and use my changes (and other’s changes) in main after qa is merged to main and main is deployed

Repository Branching Rules and Helpers

At a minimum, we like to set up:

dbt Cloud Processes and Environments

Here’s our branching strategy again, but now with the dbt Cloud processes we want to incorporate:

Indirect Promotion strategy with dbt cloud processes denotedIndirect Promotion strategy with dbt cloud processes denoted

In order to create the jobs in our diagram, we need dbt Cloud environments. Here are the common configurations for this setup:

Environment NameEnvironment TypeDeployment TypeBase BranchWill handle…
Developmentdevelopment-qaOperations done in the IDE (including creating feature branches)
Feature CIdeploymentGeneralqaA continuous integration job
Quality AssurancedeploymentStagingqaA deployment job
Release CIdeploymentGeneralmainA continuous integration job
ProductiondeploymentProductionmainA deployment job

Data Platform Organization

Now we need to focus on where we want to build things in our data platform. For that, we need to set our database and schema settings on the environments. There are two common setups for mapping code, but before we get in to those remember this note from direct promotion:

note

We are showing environment configurations here, but a default database will be set at the highest level in a connection (which is a required setting of an environment). Deployment environments can override a connection's database setting when needed.

  • Configuration 1: A 1:1 of qa and main assets In this pattern, the CI schemas are populated in a database outside of Production and QA. This is usually done to keep the databases aligned to what’s been merged on their corresponding branches. Here’s our diagram, now mapping to the data platform with this pattern:

    Indirect Promotion branches and how they relate to 1:1 organization in the data platformIndirect Promotion branches and how they relate to 1:1 organization in the data platform

    Here are our configurations for this pattern:

    Environment NameDatabaseSchema
    DevelopmentdevelopmentUser-specified in Profile Settings > Credentials
    Feature CIdevelopmentAny safe default, like dev_ci (it doesn’t even have to exist). The job we intend to set up will override the schema here anyway to denote the unique PR.
    Quality Assuranceqaanalytics
    Release CIdevelopmentA safe default
    Productionproductionanalytics
  • Configuration 2: A reflection of the workflow initiative

    In this pattern, the CI schemas populate in a qa database because it’s a step in quality assurance. Here’s our diagram, now mapping to the data platform with this pattern:

    Indirect Promotion branches and how they relate to workflow initiative organization in the data platformIndirect Promotion branches and how they relate to workflow initiative organization in the data platform

    Here are our configurations for this pattern:

    Environment NameDatabaseSchema
    DevelopmentdevelopmentUser-specified in Profile Settings > Credentials
    Feature CIqaAny safe default, like dev_ci (it doesn’t even have to exist). The job we intend to set up will override the schema here anyway to denote the unique PR.
    Quality Assuranceqaanalytics
    Release CIqaA safe default
    Productionproductionanalytics

Indirect Promotion Example

In this example, Steve uses the term “UAT” to define the automatic deployment of the middle branch and “QA” to define what’s built from feature branch pull requests. He also defines a database for each (with four databases total - one for development schemas, one for CI schemas, one for middle branch deployments, and one for production deployments) — we wanted to show you this example as it speaks to how configurable these processes are apart from our standard examples.

What did Indirect Promotion change?

You’ve probably noticed there is one overall theme of adding our additional branch, and that’s supporting our Quality Assurance initiative. Let’s break it down:

  • Development

    While no one will be developing in the qa branch itself, it does need a level of oversight just like a feature branch needs in order to stay in sync with its base branch. This is because a change now to main (like a hotfix or accidental merge) won’t immediately flag our feature branches since they are based off of qa's version of code. This branch needs to stay in sync with any change in main for this reason.

  • Quality Assurance

    There are now two places where quality can be reviewed (feature and qa) before changes hit production. qa is typically leveraged in at least one of these ways for more quality assurance work:

    • Testing and reviewing how end-to-end changes are performing over time
    • Deploying the full image of the qa changes to a centralized location. Some common reasons to deploy qa code are:
      • Testing builds from environment-specific data sets (dynamic sources)
      • Creating staging versions of workbooks in your BI tool. This is most relevant when your BI tool doesn’t do well with changing underlying schemas. For instance, some tools have better controls for grabbing a production workbook for development, switching the underlying schema to a dbt_cloud_pr_# schema, and reflecting those changes without breaking things. Other tools will break every column selection you have in your workbook, even if the structure is the same. For this reason, it is sometimes easier to create one “staging” version workbook and always point it to a database built from QA code - the changes then can always be reflected and reviewed from that workbook before the code changes in production.
      • For other folks who want to see or test changes, but aren’t personas that would be included in the review process. For instance, you may have a subject matter expert reviewing and approving alongside developers, who understands the process of looking at dbt_cloud_pr schemas. However, if this person now communicates that they have just approved some changes with development to their teammates who will use those changes, the team might ask if there is a way they can also see the changes. Since the CI schema is dropped after merge, they would need to wait see this change in production if there is no process deploying the middle branch.
  • Promotion

    There are now two places where code needs to be promoted:

    • From feature to qa by a developer and peer (and optionally SMEs or stakeholders)
    • From qa to main by a release manager and SMEs or stakeholders

    Additionally, approved changes from feature branches are promoted together from qa.

  • Deployment

    There are now major branches code can be deployed from:

    • qa : The “working” version with changes, features merge here
    • main : The “production” version

    Due to our changes collecting on the qa branch, our deployment process changes from continuous deployment (”streaming” changes to main in direct promotion) to continuous delivery (”batched” changes to main). Julia Schottenstein does a great job explaining the differences here.

Comparing Branching Strategies

Since most teams can make direct promotion work, we’ll list some key flags for when we start thinking about indirect promotion with a team:

  • They speak about having a dedicated environment for a QA, UAT, staging, or pre-production work.
  • They ask how they can test changes end-to-end and over time before things hit production.
  • Their developers aren’t the same, or the only, folks who are checking data outputs for validity - even more so if the other folks are more familiar doing this validation work from other tools.
  • Their different environments aren’t working with identical data. Like software environments, they may have limited or scrubbed versions of production data depending on the environment.
  • They have a schedule in mind for making changes “public”, and want to hold features back from being seen or usable until then.
  • They have very high-stakes data consumption.

If you fit any of these, you likely fit into an indirect promotion strategy.

Strengths and Weaknesses

We highly recommend that you choose your branching strategy based on which best supports your workflow needs over any perceived pros and cons — when these are put in the context of your team’s structure and technical skills, you’ll find some aren’t strengths or weaknesses at all!

  • Direct promotion

    Strengths

    • Much faster in terms of seeing changes - once the PR is merged and deployed, the changes are “in production”.
    • Changes don’t get stuck in a middle branch that’s pending the acceptance of someone else’s validation on data output.
    • Management is mainly distributed - every developer owns their own branch and ensuring it’s in sync with what’s in main.
    • There’s no releases to worry about, so no extra processes to manage.

    Weaknesses

    • It can present challenges for testing changes end-to-end or over time. Our desire to build only modified and directly impacted models to reduce the amount of models executed in CI goes against the grain of full end-to-end testing, and our mechanism which executes only upon pull request or new commit won’t help us test over time.
    • It can be more difficult for differing schedules or technical abilities when it comes to review. It’s essential in this strategy to include stakeholders or subject matter experts on pull requests before merge, because the next step is production. Additionally, some tools aren’t great at switching databases and schemas even if the shape of the data is the same. Constant breakage of reports for review can be too much overhead.
    • It can be harder to test configurations or job changes before they hit production, especially if things function a bit differently in development.
    • It can be harder to share code that works fully but isn’t a full reflection of a task. Changes need to be agreed upon to go to production so others can pull them in, otherwise developers need to know how to pull these in from other branches that aren’t main (and be aware of staying in sync or risk merge conflicts).
  • Indirect promotion

    Strengths

    • There’s a dedicated environment to test end-to-end changes over time.
    • Data output can be reviewed either with a developer on PR or once things hit the middle branch.
    • Review from other tools is much easier, because the middle branch tends to deploy to a centralized location. “Staging” reports can be set up to always refer to this location for reviewing changes, and processes for creating new reports can flow from staging to production.
    • Configurations and job changes can be tested with production-like parameters before they actually hit production.
    • There’s a dedicated environment to merge changes if you need them for shared development. Consumers of main will be none-the-wiser about the things that developers do for ease of collaboration.

    Weaknesses

    • Changes can be slower to get to production due to the extra processes intended for the middle branch. In order to keep things moving, there should be someone (or a group of people) in place who fully own managing the changes, validation status, and release cycle.
    • Changes that are valid can get stuck behind other changes that aren’t - having a good plan in place for how the team should handle this scenario is essential because conundrum can hold up getting things to production.
    • There’s extra management of any new trunks, which will need ownership - without someone (or a group of people) who are knowledgeable, it can be confusing understanding what needs to be done how to do it when things get out of sync.
    • Requires additional compute in the form of scheduled jobs in the qa environment as well as an additional CI job from qa > main

Further Enhancements

Once you have your basic configurations in place, you can further tweak your project by considering which other features will be helpful for your needs:

Frequently Asked git Questions

General

How do you prevent developers from changing specific files?

Code owners files can help tag appropriate reviewers when certain files or folders are changed

How do you execute other types of checks in the development workflow?

If you’re thinking about auto-formatting or linting code, you can implement this within your dbt project.

Other checks are usually implemented through git pipelines (such as GitHub Actions) to run when git events happen (such as checking that a branch name follows a pattern upon a pull request event).

How do you revert changes?

This is an action performed outside of dbt through git operations - however, we recommend instead using an immediate solution with git tags/releases until your code is fixed to your liking:

  • Apply a git tag (an available feature on most git platforms) on the commit SHA that you want to roll back to
  • Use the tag as your custom branch on your production environment in dbt Cloud. Your jobs will now check out the code at this point in time.
  • Now you can work as normal. Fix things through the development workflow or have a knowledgeable person revert the changes through git, it doesn’t matter - production is pinned to the previous state until you change the custom branch back to main!

Indirect Promotion Specific

How do you make releases?

For our examples, a release is just a pull request to get changes into main from qa, opened from the git platform.

You should be aware that having the source branch as qa on your pull request will also incorporate any new merges to qa since you opened the pull request, until it’s merged. Because of this it’s important that the person opening a release is aware of what the latest changes were and when a job last ran to indicate the success of all the release’s changes. There are two options we like to implement to make this easier:

  • A CI job for pull requests to main - this will catch and rerun our CI job if there’s any new commits on our qa branch
  • An on-merge job using our qa environment. This will run a job any time someone merges. You may opt for this if you’d rather not wait on a CI pipeline to finish when you open a release. If this option is used, the latest job that ran should be successful and linked on the release’s PR.

Hierarchical promotion introduces changes that may not be ready for production yet, which holds up releases. How do you manage that?

The process of choosing specific commits to move to another branch is called Cherry Picking.

Cherry Picking diagramCherry Picking diagram

You may be tempted to change to a less standard branching strategy to avoid this - our colleague Grace Goheen has written some thoughts on this and provided examples - it’s a worthwhile read!

dbt does not perform cherry picking operations and needs to be done from a command line interface or your git platform’s user interface, if the option is available. We align with Grace on this one — not only does cherry picking require a very good understanding of git operations and the state of the branches, but when it isn’t done with care it introduces a host of other issues that can be hard to resolve. What we tend to see is that the CI processes we’ve exemplified instead shift what the definition of the first PR’s approval is - not only can it be approved for coding and syntax by a peer, but it can also be approved for it’s output by selecting from objects built within the CI schema. This eliminates a lot of the issues with code that can’t be merged to production.

We also implement other features that can help us omit offending models or introduce more quality:

If you are seeing a need to cherry-pick regularly, assessing your review and quality assurance processes and where they are happening in your pipeline can be very helpful in determining how you can avoid it.

What if a bad change made it all the way in to production?

The process of fixing main directly is called a hotfix. This needs to be done with git locally or with your git platform’s user interface because dbt’s IDE is based on the branch you set for your developer to base from (in our case, qa).

The pattern for hotfixes in hierarchical promotion looks like this:

Hotfix diagramHotfix diagram

Here’s how it’s typically performed:

  1. Create a branch from main, test and review the fix
  2. Open a PR to main, get the fix approved, then merge. The fix is now live.
  3. Check out qa, and git pull to ensure it’s up to date with what’s on the remote
  4. Merge main into qa: git merge main
  5. git push the changes back to the remote
  6. At this point in our example, developers will be flagged in dbt Cloud’s IDE that there is a change on their base branch and can ”Pull from remote”. However, if you implement more than one middle branch you will need to continue resolving your branches hierarchically until you update the branch that developers base from.

What if we want to use more than one middle branch in our strategy?

In our experience, using more than one middle branch is rarely needed. The more steps you are away from main, the more hurdles you’ll need to jump through getting back to it. If your team isn’t properly equipped, this ends up putting a lot of overhead on development operations. For this reason, we don’t recommend more branches if you can help it. The teams who are successful with more trunks are built with plenty of folks who can properly dedicate the time and management to these processes.

A git strategy with more branchesA git strategy with more branches

This structure is mostly desired when there are requirements for using different versions data (i.e, scrubbed data) by different teams, but working with the same code changes. This structure allows each team to have a dedicated environment for deployments. Example:

  1. Developers work off of mocked data for their feature branches and merge to qa for end-to-end and over-time testing of all merged changes before releasing to preproduction.
  2. Once qa is merged to preproduction, the underlying data being used switches to using scrubbed production data and other personas can start looking at and reviewing how this data is functioning before it hits production.
  3. One preproduction is merged to main, the underlying data being used switches to production data sets.

This use case can be covered with a more simple branching strategy through the use of git tags and dbt environment variables to switch source data:

  • Indirect Promotion:

    Tagging in Indirect PromotionTagging in Indirect Promotion
  • Direct Promotion:

    Tagging in Direct PromotionTagging in Direct Promotion

No matter the reason for more branches, these points are always relevant to plan out:

  • Can we accurately describe the use case of each branch?
  • Who owns the oversight of any new branches?
  • Who are the major players in the promotion process between each branch and what are they responsible for?
  • Which major branches do we want dbt Cloud deployment jobs for?
  • Which PR stages do we want continuous integration jobs on?
  • Which major branch rules or PR templates do we need to add?

By answering these questions, you should be able to follow our same guidance from our examples for setting up your additional branches.

Direct Promotion Specific

We need a middle environment and don’t want to change our branching strategy! Is there any way to reflect what’s in development?

git releases/tags are a mechanism which help you label a specific commit SHA. Deployment environments in dbt Cloud can use these just like they can a custom branch. Teams will leverage this either to pin their environments to code at a certain point in time or to keep as a roll-back option if needed.

We can use the pinning method to create our middle environment. Example:

  • We create a release tag, v2, from our repository.
  • We specify v2 as our branch in our Production environment’s custom branch setting. Jobs using Production will now check out code at v2.
  • We set up an environment called “QA”, with the custom branch setting as main. For the database and schema, we specify the qa database and analytics schema. Jobs created using this environment will check out code from main and built it to qa.analytics.
Tagging in Direct Promotion to create a middle environmentTagging in Direct Promotion to create a middle environment

How do we change from a direct promotion strategy to an indirect promotion strategy?

Here’s the additional setup steps in a nutshell - for more details be sure to read through the indirect promotion section:

  • git Platform
    • Create a new branch derived from main for your middle branch.
    • Protect the branch with branch protection rules
  • dbt Cloud
    • Development: Switch your environment to use the custom branch option and specify your new middle branch’s name. This will base developers off of the middle branch.
    • Continous Integration: If you have an existing environment for this, ensure the custom branch is also changed to the middle branch’s name. This will change the CI job’s trigger to occur on pull requests to your middle branch.

At this point, your developers will be following the indirect promotion workflow and you can continue working on things in the background. You may still need to set up a database, database permissions, environments, deployment jobs, etc. Here is a short checklist to help you out! Refer back to our section on indirect promotion for many more details:

  • Decide if you want to deploy your middle branch’s code. If so:
    • If needed, create the database where the objects will build

    • Set up a service and give it all the proper permissions. For example, if that will be in a database,

      the service account should have full access to create and modify the contents within this database. It should also have select-only access to raw data.

    • Set up an environment for your middle branch in dbt Cloud, being sure to connect it to the location you want your deployments to build in.

    • Set up any deployment jobs using your middle branch’s environment

  • Decide if you want CI on release pull requests (from your middle branch to main). If so:
    • Set up an environment called “Release CI”
    • Set up the continuous integration job using the “Release CI” environment

Why I wish I had a control plane for my renovation

· 4 min read
Mark Wan

When my wife and I renovated our home, we chose to take on the role of owner-builder. It was a bold (and mostly naive) decision, but we wanted control over every aspect of the project. What we didn’t realize was just how complex and exhausting managing so many moving parts would be.

My wife pondering our sanityMy wife pondering our sanity

We had to coordinate multiple elements:

  • The architects, who designed the layout, interior, and exterior.
  • The architectural plans, which outlined what the house should look like.
  • The builders, who executed those plans.
  • The inspectors, councils, and energy raters, who checked whether everything met the required standards.

Test smarter not harder: Where should tests go in your pipeline?

· 8 min read
Faith McKenna
Jerrie Kumalah Kenney

👋 Greetings, dbt’ers! It’s Faith & Jerrie, back again to offer tactical advice on where to put tests in your pipeline.

In our first post on refining testing best practices, we developed a prioritized list of data quality concerns. We also documented first steps for debugging each concern. This post will guide you on where specific tests should go in your data pipeline.

Note that we are constructing this guidance based on how we structure data at dbt Labs. You may use a different modeling approach—that’s okay! Translate our guidance to your data’s shape, and let us know in the comments section what modifications you made.

First, here’s our opinions on where specific tests should go:

  • Source tests should be fixable data quality concerns. See the callout box below for what we mean by “fixable”.
  • Staging tests should be business-focused anomalies specific to individual tables, such as accepted ranges or ensuring sequential values. In addition to these tests, your staging layer should clean up any nulls, duplicates, or outliers that you can’t fix in your source system. You generally don’t need to test your cleanup efforts.
  • Intermediate and marts layer tests should be business-focused anomalies resulting specifically from joins or calculations. You also may consider adding additional primary key and not null tests on columns where it’s especially important to protect the grain.

Test smarter not harder: add the right tests to your dbt project

· 11 min read
Faith McKenna
Jerrie Kumalah Kenney

The Analytics Development Lifecycle (ADLC) is a workflow for improving data maturity and velocity. Testing is a key phase here. Many dbt developers tend to focus on primary keys and source freshness. We think there is a more holistic and in-depth path to tread. Testing is a key piece of the ADLC, and it should drive data quality.

In this blog, we’ll walk through a plan to define data quality. This will look like:

  • identifying data hygiene issues
  • identifying business-focused anomaly issues
  • identifying stats-focused anomaly issues

Once we have defined data quality, we’ll move on to prioritize those concerns. We will:

  • think through each concern in terms of the breadth of impact
  • decide if each concern should be at error or warning severity

How Hybrid Mesh unlocks dbt collaboration at scale

· 7 min read
Jason Ganz

One of the most important things that dbt does is unlock the ability for teams to collaborate on creating and disseminating organizational knowledge.

In the past, this primarily looked like a team working in one dbt Project to create a set of transformed objects in their data platform.

As dbt was adopted by larger organizations and began to drive workloads at a global scale, it became clear that we needed mechanisms to allow teams to operate independently from each other, creating and sharing data models across teams — dbt Mesh.

How to build a Semantic Layer in pieces: step-by-step for busy analytics engineers

· 10 min read

The dbt Semantic Layer is founded on the idea that data transformation should be both flexible, allowing for on-the-fly aggregations grouped and filtered by definable dimensions and version-controlled and tested. Like any other codebase, you should have confidence that your transformations express your organization’s business logic correctly. Historically, you had to choose between these options, but the dbt Semantic Layer brings them together. This has required new paradigms for how you express your transformations though.

Putting Your DAG on the internet

· 5 min read
Ernesto Ongaro
Sebastian Stan
Filip Byrén

New in dbt: allow Snowflake Python models to access the internet

With dbt 1.8, dbt released support for Snowflake’s external access integrations further enabling the use of dbt + AI to enrich your data. This allows querying of external APIs within dbt Python models, a functionality that was required for dbt Cloud customer, EQT AB. Learn about why they needed it and how they helped build the feature and get it shipped!

Unit testing in dbt for test-driven development

· 9 min read
Doug Beatty

Do you ever have "bad data" dreams? Or am I the only one that has recurring nightmares? 😱

Here's the one I had last night:

It began with a midnight bug hunt. A menacing insect creature has locked my colleagues in a dungeon, and they are pleading for my help to escape . Finding the key is elusive and always seems just beyond my grasp. The stress is palpable, a physical weight on my chest, as I raced against time to unlock them.

Of course I wake up without actually having saved them, but I am relieved nonetheless. And I've had similar nightmares involving a heroic code refactor or the launch of a new model or feature.

Good news: beginning in dbt v1.8, we're introducing a first-class unit testing framework that can handle each of the scenarios from my data nightmares.

Before we dive into the details, let's take a quick look at how we got here.

Maximum override: Configuring unique connections in dbt Cloud

· 6 min read

dbt Cloud now includes a suite of new features that enable configuring precise and unique connections to data platforms at the environment and user level. These enable more sophisticated setups, like connecting a project to multiple warehouse accounts, first-class support for staging environments, and user-level overrides for specific dbt versions. This gives dbt Cloud developers the features they need to tackle more complex tasks, like Write-Audit-Publish (WAP) workflows and safely testing dbt version upgrades. While you still configure a default connection at the project level and per-developer, you now have tools to get more advanced in a secure way. Soon, dbt Cloud will take this even further allowing multiple connections to be set globally and reused with global connections.

LLM-powered Analytics Engineering: How we're using AI inside of our dbt project, today, with no new tools.

· 10 min read
Joel Labes

Cloud Data Platforms make new things possible; dbt helps you put them into production

The original paradigm shift that enabled dbt to exist and be useful was databases going to the cloud.

All of a sudden it was possible for more people to do better data work as huge blockers became huge opportunities:

  • We could now dynamically scale compute on-demand, without upgrading to a larger on-prem database.
  • We could now store and query enormous datasets like clickstream data, without pre-aggregating and transforming it.

Today, the next wave of innovation is happening in AI and LLMs, and it's coming to the cloud data platforms dbt practitioners are already using every day. For one example, Snowflake have just released their Cortex functions to access LLM-powered tools tuned for running common tasks against your existing datasets. In doing so, there are a new set of opportunities available to us:

Column-Level Lineage, Model Performance, and Recommendations: ship trusted data products with dbt Explorer

· 9 min read
Dave Connors

What’s in a data platform?

Raising a dbt project is hard work. We, as data professionals, have poured ourselves into raising happy healthy data products, and we should be proud of the insights they’ve driven. It certainly wasn’t without its challenges though — we remember the terrible twos, where we worked hard to just get the platform to walk straight. We remember the angsty teenage years where tests kept failing, seemingly just to spite us. A lot of blood, sweat, and tears are shed in the service of clean data!

Once the project could dress and feed itself, we also worked hard to get buy-in from our colleagues who put their trust in our little project. Without deep trust and understanding of what we built, our colleagues who depend on your data (or even those involved in developing it with you — it takes a village after all!) are more likely to be in your DMs with questions than in their BI tools, generating insights.

When our teammates ask about where the data in their reports come from, how fresh it is, or about the right calculation for a metric, what a joy! This means they want to put what we’ve built to good use — the challenge is that, historically, it hasn’t been all that easy to answer these questions well. That has often meant a manual, painstaking process of cross checking run logs and your dbt documentation site to get the stakeholder the information they need.

Enter dbt Explorer! dbt Explorer centralizes documentation, lineage, and execution metadata to reduce the work required to ship trusted data products faster.

More time coding, less time waiting: Mastering defer in dbt

· 9 min read
Dave Connors

Picture this — you’ve got a massive dbt project, thousands of models chugging along, creating actionable insights for your stakeholders. A ticket comes your way — a model needs to be refactored! "No problem," you think to yourself, "I will simply make that change and test it locally!" You look at your lineage, and realize this model is many layers deep, buried underneath a long chain of tables and views.

“OK,” you think further, “I’ll just run a dbt build -s +my_changed_model to make sure I have everything I need built into my dev schema and I can test my changes”. You run the command. You wait. You wait some more. You get some coffee, and completely take yourself out of your dbt development flow state. A lot of time and money down the drain to get to a point where you can start your work. That’s no good!

Luckily, dbt’s defer functionality allow you to only build what you care about when you need it, and nothing more. This feature helps developers spend less time and money in development, helping ship trusted data products faster. dbt Cloud offers native support for this workflow in development, so you can start deferring without any additional overhead!

To defer or to clone, that is the question

· 6 min read
Kshitij Aranke
Doug Beatty

Hi all, I’m Kshitij, a senior software engineer on the Core team at dbt Labs. One of the coolest moments of my career here thus far has been shipping the new dbt clone command as part of the dbt-core v1.6 release.

However, one of the questions I’ve received most frequently is guidance around “when” to clone that goes beyond the documentation on “how” to clone. In this blog post, I’ll attempt to provide this guidance by answering these FAQs:

  1. What is dbt clone?
  2. How is it different from deferral?
  3. Should I defer or should I clone?

Optimizing Materialized Views with dbt

· 11 min read
Amy Chen
note

This blog post was updated on December 18, 2023 to cover the support of MVs on dbt-bigquery and updates on how to test MVs.

Introduction

The year was 2020. I was a kitten-only household, and dbt Labs was still Fishtown Analytics. A enterprise customer I was working with, Jetblue, asked me for help running their dbt models every 2 minutes to meet a 5 minute SLA.

After getting over the initial terror, we talked through the use case and soon realized there was a better option. Together with my team, I created lambda views to meet the need.

Flash forward to 2023. I’m writing this as my giant dog snores next to me (don’t worry the cats have multiplied as well). Jetblue has outgrown lambda views due to performance constraints (a view can only be so performant) and we are at another milestone in dbt’s journey to support streaming. What. a. time.

Today we are announcing that we now support Materialized Views in dbt. So, what does that mean?

Create dbt Documentation and Tests 10x faster with ChatGPT

· 8 min read
Pedro Brito de Sa

Whether you are creating your pipelines into dbt for the first time or just adding a new model once in a while, good documentation and testing should always be a priority for you and your team. Why do we avoid it like the plague then? Because it’s a hassle having to write down each individual field, its description in layman terms and figure out what tests should be performed to ensure the data is fine and dandy. How can we make this process faster and less painful?

By now, everyone knows the wonders of the GPT models for code generation and pair programming so this shouldn’t come as a surprise. But ChatGPT really shines at inferring the context of verbosely named fields from database table schemas. So in this post I am going to help you 10x your documentation and testing speed by using ChatGPT to do most of the leg work for you.

Data Vault 2.0 with dbt Cloud

· 15 min read
Rastislav Zdechovan
Sean McIntyre

Data Vault 2.0 is a data modeling technique designed to help scale large data warehousing projects. It is a rigid, prescriptive system detailed vigorously in a book that has become the bible for this technique.

So why Data Vault? Have you experienced a data warehousing project with 50+ data sources, with 25+ data developers working on the same data platform, or data spanning 5+ years with two or more generations of source systems? If not, it might be hard to initially understand the benefits of Data Vault, and maybe Kimball modelling is better for you. But if you are in any of the situations listed, then this is the article for you!

Building a historical user segmentation model with dbt

· 14 min read
Santiago Jauregui

Introduction

Most data modeling approaches for customer segmentation are based on a wide table with user attributes. This table only stores the current attributes for each user, and is then loaded into the various SaaS platforms via Reverse ETL tools.

Take for example a Customer Experience (CX) team that uses Salesforce as a CRM. The users will create tickets to ask for assistance, and the CX team will start attending them in the order that they are created. This is a good first approach, but not a data driven one.

An improvement to this would be to prioritize the tickets based on the customer segment, answering our most valuable customers first. An Analytics Engineer can build a segmentation to identify the power users (for example with an RFM approach) and store it in the data warehouse. The Data Engineering team can then export that user attribute to the CRM, allowing the customer experience team to build rules on top of it.

Modeling ragged time-varying hierarchies

· 18 min read
Sterling Paramore

This article covers an approach to handling time-varying ragged hierarchies in a dimensional model. These kinds of data structures are commonly found in manufacturing, where components of a product have both parents and children of arbitrary depth and those components may be replaced over the product's lifetime. The strategy described here simplifies many common types of analytical and reporting queries.

To help visualize this data, we're going to pretend we are a company that manufactures and rents out eBikes in a ride share application. When we build a bike, we keep track of the serial numbers of the components that make up the bike. Any time something breaks and needs to be replaced, we track the old parts that were removed and the new parts that were installed. We also precisely track the mileage accumulated on each of our bikes. Our primary analytical goal is to be able to report on the expected lifetime of each component, so we can prioritize improving that component and reduce costly maintenance.

How we reduced a 6-hour runtime in Alteryx to 9 minutes with dbt and Snowflake

· 12 min read
Arthur Marcon
Lucas Bergo Dias
Christian van Bellen

Alteryx is a visual data transformation platform with a user-friendly interface and drag-and-drop tools. Nonetheless, Alteryx may have difficulties to cope with the complexity increase within an organization’s data pipeline, and it can become a suboptimal tool when companies start dealing with large and complex data transformations. In such cases, moving to dbt can be a natural step, since dbt is designed to manage complex data transformation pipelines in a scalable, efficient, and more explicit manner. Also, this transition involved migrating from on-premises SQL Server to Snowflake cloud computing. In this article, we describe the differences between Alteryx and dbt, and how we reduced a client's 6-hour runtime in Alteryx to 9 minutes with dbt and Snowflake at Indicium Tech.

Building a Kimball dimensional model with dbt

· 20 min read
Jonathan Neo

Dimensional modeling is one of many data modeling techniques that are used by data practitioners to organize and present data for analytics. Other data modeling techniques include Data Vault (DV), Third Normal Form (3NF), and One Big Table (OBT) to name a few.

Data modeling techniques on a normalization vs denormalization scaleData modeling techniques on a normalization vs denormalization scale

While the relevance of dimensional modeling has been debated by data practitioners, it is still one of the most widely adopted data modeling technique for analytics.

Despite its popularity, resources on how to create dimensional models using dbt remain scarce and lack detail. This tutorial aims to solve this by providing the definitive guide to dimensional modeling with dbt.

By the end of this tutorial, you will:

  • Understand dimensional modeling concepts
  • Set up a mock dbt project and database
  • Identify the business process to model
  • Identify the fact and dimension tables
  • Create the dimension tables
  • Create the fact table
  • Document the dimensional model relationships
  • Consume the dimensional model