Home > front end >  How to deal with data structure changes when performing a full historical data load?
How to deal with data structure changes when performing a full historical data load?

Time:01-01

I'm dealing with a SQL Server database which contains a column "defined data" with JSON data in it (and some other simple columns). The data builds up over time, right now we have about 8 million rows.

The data from this db is periodically read by an ETL system which then reads the JSON data in the "defined data" column and maps the data to a new SQL Server table based on the columns names contained in the JSON data.

This SQL Server table is prone to changes, meaning that about every 4 months additional columns are needed or column names change. Whenever this SQL Server table changes its data structure, a new version is introduced, which also forces the JSON data structure to change.

However, the ETL system should still be able to load all historical (JSON) data from the SQL Server database, regardless of the changing version throughout time. How can I make this work, taking into consideration version changes of the SQL Server tables and the JSON data?

!example]1

So in this example my question is:

How can I ensure that I can load both client 20 and 21 into one SQL Server table without getting errors because the JSON data structure is not reflecting version 2 in the case of historical data?

Given the size of the SQL Server database, it doesn't seem like an option to update all historical JSON data according to the latest version (in this example that would mean adding "AssetType" for the 01-01-2021 data and filling it in with NULL).

Many, many thanks in advance!

CodePudding user response:

First I would check if json fields exist in the table as column names by looking them up in the information schema. If not exists then alter table add column.

CodePudding user response:

How can I ensure that I can load both client 20 and 21 into one SQL Server table without getting errors because the JSON data structure is not reflecting version 2 in the case of historical data?

You maintain 2 separate tables. A Raw/Staging/Bronze table that has the same schema as the source, and a Cleansed/Warehouse/Silver table that has the desired schema for reporting. If you have multiple separate sources, you may have separate Raw tables.

Periodically you enhance the schema of the Cleansed table to add new data that has appeared in the Raw table.

  • Related