I'm storing in a delta table the prices of products. The schema of the table is like this:
id | price | updated
1 | 3 | 2022-03-21
2 | 4 | 2022-03-20
3 | 3 | 2022-03-20
I upsert rows using the id field as primary key and updating the price and updated field.
I'm trying to have the serie of prices over time using databrick time travel. But looking the documentation apparently I can only look 2 versions of a table like this
%sql
SELECT count(distinct id) - (
SELECT count(distinct id)
FROM table TIMESTAMP AS OF date_sub(current_date(), 7))
FROM table
Is there a way to select the different prices off all version ? Like: Distinct prices.
CodePudding user response:
I would really not recommend to use time travel for that for following reasons:
- If your data is updated frequently, then you will have a lot of versions, and your performance will degrade over the time, as handling of huge number of versions (10s of thousands) will put a lot of pressure on driver
- It's very hard to do historical analysis, as you can see already - for each version you will need to have subqueries and union data.
Instead, you can use two tables - first with actual data, and second - with historical data, ideally, building the SCD Type 2 (Slowly Changing Dimensions) with markers for which period which price was active. You can build that second table using Change Data Feed (CDF) functionality to pull changes from first table, and applying them to the second table using MERGE operation. Databricks documentation includes example of using MERGE to build SCD Type 2 (although without CDF).
With this approach it will be easy for you to perform historical analysis, as all data will be in the same table and you don't need to use time travel