Home > OS >  Data Warehouse design(BigQuery), load into dimensional table independent of fact table
Data Warehouse design(BigQuery), load into dimensional table independent of fact table

Time:11-01

I want to design a data warehouse (Data MART) with one fact table and 2 dimensional tables, where the data mart takes some Slowly Changing Dimensions into consideration, with surrogate key. I'm wondering how I can model this so that data insertion to the dimensional tables can be made independent (inserted before fact table row exist) of the fact table. The data will be streamed from PubSub to BigQuery via Dataflow, thus some of the dimensional data might arrive earlier, needing to be inserted into the dimensional table before the fact data.

CodePudding user response:

I don't completely understand your question. Dimensions are always (or rather, almost always) populated before fact tables are, since fact table records refer to dimensions (and not the other way around).

If you're worried about being able to destroy and rebuild your dimension table without having to also rebuild your fact table, then you'll need to use some sort of surrogate key pipeline to maintain your surrogate key to natural key relationships. But again, I'm not sure that this is what you're asking.

CodePudding user response:

BigQuery does not perform referential integrity check, which means it will not check whether parent row exists in dimension table while inserting child row into fact table and you don't need this in data analytics setup. You can keep appending records to both fact table and dimension tables independently in BigQuery.

Flatten / denormalise the table and keep dimensions in fact tables - repeated records are not going to be an issue in BigQuery - you can make use of Clustered and Partitioned Tables

Other option is, if you have dimensions in RDBMS system, upload dimension tables as files to Cloud Storage / rows to Cloud SQL and join them in Dataflow, in this case you can skip multiple sinks - you can write to a flatten schema into a BigQuery table sink.

Inserting order does not matter in BigQuery, you can reference event records based on pubsub message publishing time / source event time, etc.

  • Related