-
Notifications
You must be signed in to change notification settings - Fork 1.7k
rfc_1_orchestration
This document proposes a new top-level construct in dbt: the job
. The job
block will be responsible for "orchestrating" runs of dbt projects.
Presently, dbt runs like a firecracker: It picks a starting point based on the shape of the graph and runs everything in its path until there's nothing left to run. While dbt projects contain models, tests, hooks, and so on, the order and manner in which these constructs are run is largely dictated by dbt.
dbt projects are unduly restrictive in a few different ways.
First, modifiers like --full-refresh
and --non-destructive
are all-or-nothing: they are either applied to all models, or no models at all. Ideally, these configurations could be applied to one or more models, selectively, depending on the needs of the user. These configurations might differ between dev and prod, or between hourly and nightly runs of dbt, for instance.
Second, arbitrary (non-model) sql is confined to running in a few different places in dbt:
- Before the entire run (
on-run-start
) - Before a given model (
pre-hook
) - After a given model (
post-hook
) - After the entire run (
on-run-end
)
dbt users should be able to inject arbitrary sql into their dbt runs between individual models, or between subgraphs of models. This might look like vacuuming the tables in the snowplow
source data schema before a project's snowplow
models execute in production.
This could also take the shape of inserting records like
(run_id, model_name, start_time, end_time)
into an audit table before and after each model runs, but only in production. The in production qualifier here, while minor, makes this task unduly difficult in dbt ~0.9.0. Tasks like this should be simple and straightforward to accomplish in dbt.
Third, "resources" cannot currently be mixed within an invocation of dbt. The commands:
$ dbt seed
$ dbt archive
$ dbt run
$ dbt test
will load seed data and run archives, models, and tests, respectively. Instead of running four different commands, it should be possible to "orchestrate" the execution of these tasks within a single run of dbt. This will make it possible to, for example, run and test your snowplow
models in a single command. While saving keystrokes is good and noble, it also makes complex deployments of dbt easier to manage.
Finally, dbt's current yaml-based approach to configuration is unwieldy. The dbt_project.yml
configuration is tied to the folder hierarchy of models on disk. Moving or renaming files can silently break dbt projects. This makes configuring models inside of packages more difficult than it should be. Further, some configuration options accept jinja code and some do not. This configuration should be at once simpler and more powerful.
While these items are not an exhaustive list of the shortcomings of dbt as it exists today, they do frame the limitations of the existing programming model. The key takeaway is that, in short: right now dbt runs you, but we think that you should be running dbt.
Models, Tests, Operations, and Archives are all represented internally as "Resources" by dbt. The term "Resource" will be used below to refer to objects like tests, models, operations, and archives, as the following principles are not specific to any specific dbt construct.
By introducing a new Jinja block, the job
, dbt can accomplish everything listed above, and more. These job
blocks will be responsible for 1) selecting resources 2) applying configuration and 3) invoking resources.
Jobs can be defined in any source file in the source-paths
directory of a dbt project.
Job blocks will look like this:
{% job default %}
.... code ....
{% endjob %}
Job blocks must be named, but require no other configuration. This name allows the job to be invoked from the command line:
$ dbt run default
Or more simply
# implicitly run the job named "default"
$ dbt run
This new command-line structure means that top-level dbt commands like dbt test
, dbt seed
, and dbt archive
will go away. Instead, all invocations of dbt will go through the dbt run
command.
Here are two job blocks: one for dev
and one for prod
, for a project built using the snowplow
package.
----------------------------------------------
-- A simple development job to run all models
----------------------------------------------
{% job dev %}
-- Run all models in the project
{% do _.select("models").run() %}
{% endjob %}
----------------------------------------------
-- A complex production deployment
----------------------------------------------
{% job prod %}
-- Vacuum all of the source tables in the `snowplow` schema (using the `vacuum_tables_in_schema` macro)
{% do vacuum_tables_in_schema('snowplow') %}
-- Reconfigure `snowplow_sessions` to run in full-refresh mode
{% do _.select('models[name=snowplow_sessions]').config({"full_refresh": True}) %}
-- Select all of the models in the Snowplow package
{% set snowplow_models = _.select("models[package=snowplow]") %}
-- Add a post-hook to vacuum any incremental models in the Snowplow package (using the `vacuum_table` macro)
{% do snowplow_models.select('[materialized=incremental]').onComplete(vacuum_table) %}
-- Run all of the snowplow models (and their parents)
{% set snowplow_models.run(parents=True, children=True) %}
-- Insert audit records for each of the previously run models using a macro
{% do snowplow_models.onComplete(insert_audit_records) %}
{% endjob %}
Because the job
block requires users to explicitly invoke resources, dbt must provide a mechanism for selecting resources to run. This selection mechanism must be simple, unambiguous, and comprehensive.
Simple: These selectors should be easy and intuitive to write -- constantly trawling through the docs to find the correct syntax would be unpleasant for dbt users. This selection syntax will also likely make it's way into the CLI, so it should be reasonable compact and comprehensible.
Unambiguous:
The existing --models
selection syntax on the dbt command line is ambiguous. The following command can mean three different things:
$ dbt run --models snowplow
- Run a model named
snowplow
- Run the models in the
models/snowplow
directory - Run the models in the
snowplow
package
A viable resource selection syntax will be totally unambiguous.
Comprehensive The selector syntax should make it easy to select common groupings of resources, and possible to select complex groups of resources. There should never be a class of resources that is impossible to select. If the resources can't be selected, then they can't be run!
The resource selection syntax shown in this document is inspired by jQuery and underscore.js. Both of these libraries are used to select, filter, and operate on complex data structures, so they serve as useful starting points for dbt's graph-selection syntax.
When a job
block is parsed, a variable will be added to the block's context representing the entire set of defined resources in the project. This document uses an underscore (_
) for the variable name, but it could equivalently be named dbt
or graph
or any other valid Python variable name. This variable is a Selection
object.
Selection
objects provide a number of useful functions for interacting with a given selection. Each of these functions returns another Selection
object, so these selectors can be easily chained! This document proposes a single select
function, but others like exclude
, intersect
, union
, etc are both possible and compelling.
The select
function will accept one or more string arguments. Each string argument should be in the format:
'<resource>[<attribute><qualifier><value>, ...]'
- Resource can be one of: {
models
,tests
,archives
,seeds
,operations
} - Attribute can be one of:
-
name
: the name of a model -
package
: the name of a package -
tags
: the tags attached to a model - any configuration option provided to a model, eg.
materialized
-
- Qualifier can be one of : {
=
,!=
,*=
} - Value can be any string
Here are some examples of valid selectors:
# Select a single model
'models[name=snowplow_sessions]'
# Select all of the models in a package
'models[package=snowplow]'
# Select models by an attribute
'models[materialized=table]'
# Select models by multiple attributes
'models[package=snowplow, materialized=incremental]'
# Select all tests
'tests'
# Select all tests containing the tag `base-model`
'tests[tags*=base-model]'
# Select all archives
'archives'
These selectors can be used to select nodes using the select
function:
# Select all models in the snowpow package OR materialized as tables
_.select('models[package=snowplow])
Finally, Selection
objects should provide the following methods:
-
children()
get all children nodes of the selected nodes -
parents()
get all parent nodes of the selected nodes
These methods also return Selection
objects.
The Selection
object returned by calls to select()
will provide a function, config()
, intended to configure resources. This config
function will work just like the existing config()
implementation, with the notable exception that it can be called more than one time. Subsequent calls to config
will override previous configuration settings. Note that config
is called on a set of resources -- even if that set only contains one element.
# Configure all models to be materialized as tables
_.select('models').config(materialized='table')
# OR:
_.select('models').config({'materialized': 'table'})
# Configure a specific model
_.select('models[name=snowplow_sessions]').config(materialized='table')
If a Selection
object contains a single model, that model can be "replaced" with another model. This is useful for augmenting models defined in packages, for instance. This syntax looks like:
{% set local_model = _.select(models[name=snowplow_sessions_local, package=internal_analytics]) %}
{% set package_model = _.select(models[name=snowplow_sessions, package=snowplow]) %}
{% do package_model.replace_with(local_model) %}
{% do package_model.run() }
Or:
{% set local_model = _.find('snowplow_sessions_local']) %}
{% do _.find('snowplow_sessions').replace_with(local_model) %}
{% do package_model.run() }
Here, the find
function is like select
, except it returns a Selection containing a single model uniquely identified by its name.
Hooks can be added to resources using onStart
and onComplete
functions of a Selection
object. In this way, hooks can be applied to subsets of models, but they can also vary across jobs. This might look like running vacuum hooks in production, but not in development, for instance. The interface for these methods looks like:
Selection.onStart(sql_or_macro)
Selection.onComplete(sql_or_macro)
The onStart
and onComplete
functions can either be called with a SQL string or with a macro. Macros provided to onStart
should accept one argument: a resource object. Macros provided to onComplete
should accept either one two arguments: a resource
object (required), and a result
object (optional). If a macro is used, it should return runnable SQL that will be executed by dbt.
In practice, this code will look like:
_.select('models').onComplete("grant select on table {{ this }} to BI_USER")
Or, better, use a macro:
{% macro grant_model(this) %}
grant select on table {{ this }} to BI_USER;
{% endmacro %}
_.select('models').onComplete(grant_model)
Finally, arbitrary SQL can be executed using the sql
function:
sql('grant select on all tables in schema {{ target.schema }} to BI_USER')
In addition to being configurable, Selection
objects are also executable. All of the resources selected by a Selection
can be executed using the .run()
function. The signature for this function looks like:
Selection.run(parents=False, children=False)
This function is supported for all resource types. Whereas calling run
on a model will execute that model, calling run
on a test will run the test and report on results.
The run
function will store the results of the executed resources in memory. These Result
objects will contain information about the execution of the selected resources including the execution status and start/end/elapsed time. They can be accessed through onComplete
hooks, or via a global variable.
While job
s are the core construct responsible for orchestrating dbt runs, resources can also be configured outside of a job. This is useful for 1) applying configs to groups of models across all job
s and 2) configuring models contained within packages. This syntax would look like:
{% do _.select('models[package=snowplow]').config({"vars": {"events_table":ref('base_events')}}) %}
{% do _.select('models[package=mailchimp]').config(enabled=False) %}
{% job default %}
{% do _.select(models).run() %}
{% endjob %}
Code inside of the job
block should be executed lazily. Functions like Selection.run
or Selection.onComplete
should translate to a set of instructions to execute, but they should not immediately execute themselves. This will enable features like dbt run --dry
or dbt info [model-name]
to statically understand the entirety of the job without needing to actually run it.