Home > Software design >  Insert only specific columns with DBT
Insert only specific columns with DBT

Time:08-04

As of now, I have an INSERT INTO clause, that works like this:

insert into final_table (name, age, created_at)
select *
from person_table

The final_table has 4 columns: ID (incremental/serial), name, age and created_at. The person_table has only name, age, and created_at.

When I translate it to the dbt logic, it looks like this:

{{
    config(
        materialized = 'incremental'
    )
}}

with person_table as (
  select *
  from person_table
)

select
   name,
   age,
   created_at
from person_table

{% if is_incremental() %}
    AND created_at > (select max(created_at) from {{ this }})
{% endif %}

However, dbt keeps compiling as an INSERT CLAUSE containing the ID (that it is auto-generated). Does anyone know how to insert only the specific columns that I want?

CodePudding user response:

You can use row_number to fake an auto-incrementing key in Snowflake.

In a model materialized as a table, that would be as simple as:

select
   name,
   age,
   created_at,
   row_number() over (order by created_at asc) as id
from person_table

But since this is an incremental model, your inner query will be filtered before row_number() is computed, so we need to add on the maximum value of id in the table if the model is being run in incremental mode:

{{
    config(
        materialized = 'incremental'
    )
}}

{% if is_incremental() %}
    {% set max_id = "(select max(id) from " ~ this ~ ")" %}
{% else %}
    {% set max_id = 0 %}
{% endif %}

with person_table as (
  select *
  from {{ ref('person') }}
)

select
   name,
   age,
   created_at,
   {{ max_id }}   row_number() over (order by created_at asc) as id
from person_table
where 1=1
{% if is_incremental() %}
    AND created_at > (select max(created_at) from {{ this }})
{% endif %}

Note that, if you --full-refresh this model, and records were deleted from person_table, then any records created after the deleted records will have their IDs changed. No real way around that, unfortunately! If you need an immutable ID, I would suggest hashing some combination of name, age, and created_at using md5 or dbt_utils.surrogate_key. Of course, those hashed ID's would not be monotonically increasing.

  • Related