Using environment-dependent source() functions in DBT-CodePudding

I have a bit of an odd problem where I need to union together multiple source databases in production, but only one in lower environments. I'm using DBT and would like to use the source functionality so I can trace the origin of my data lineage, but I'm not quite sure how to handle this case. Here's the naive non-source approach:

    {% set clouds = [1, 2, 3] %} {# this clouds variable will come from the environment, instead of hard coded. In lower envs, it would just be [1] #}
    {% for cloudId in clouds %}

    select *
    from raw_{{ cloudId }}.users

        {% if not loop.last %}
    union all
    {% endif %}

    {% endfor %}

This isn't ideal, because I'm referencing my raw_n schema(s) directly. I'd love to have something like this:

version: 2

sources:
  - name: raw_1
    schema: raw_1
    database: raw
    tables:
      - name: users
        identifier: users
  - name: raw_2 # this and raw_3 would only exist in prod!
    schema: raw_2
    database: raw
    tables:
      - name: users
        identifier: users

but I'm not sure how to make such a configuration possible based on environment. Can this just simply not be done in dbt?

CodePudding user response：

Since source is just a python/jinja function you can pass variables to it. So following should work:

    {% set clouds = [1, 2, 3] %} {# this clouds variable will come from the environment, instead of hard coded. In lower envs, it would just be [1] #}
    {% for cloudId in clouds %}

    select *
    from {{ source(cloudId, 'users') }}

        {% if not loop.last %}
    union all
    {% endif %}

    {% endfor %}

as for the environment part you would have to use env_var function but those are always strings so you would write env_var('my_list').split(',') assuming its comma separated.

CodePudding user response：

Setting context here, I believe your primary interest is in working with the dbt docs / lineage graph for a prod / dev case?

In that case, as you are highlighting, the manifest is generated from the source.yml files within your model directories. So - effectively what you are asking about is the way to "activate" different versions of a single source.yml file based on environment?

Fair warning: dbt core's intentions doesn't align with that use case. So let's explore some alternatives.

If you want to hack something that is dbt-cli / local only, Jeremy lays out that you could approach this via bash/stdout:

The bash-y way

Leveraging the fact that dbt/Jinja can write anything its heart desires to stdout, pipe the stdout contents somewhere else.

$ dbt run-operation generate_source --args 'schema_name: jaffle_shop'
> models/staging/jaffle_shop/src_jaffle_shop.yml

At least one reason that he points out is that there would be security implications if the dbt jinja compiler was un-sandboxed from the /target destination so I wouldn't expect this to change from dbt-core in the future.

A non-dbt helper tool.

From later in the same comment:

At some point, I'm so on board with the request—I'm just not sure if dbt Core can be the right tool for the job, or it's a better fit for a helper tool written in python.

Use git hooks to "switch" the active "source.yml" file in that directory?

This is just an idea that I haven't even looked into because it's somewhat far-fetched but basically use your environment variables to activate pre-run hooks that set source-dev.yml to gitignore in production and vice versa? The files would have to be defined statically so I'm not sure that helps anyway.