I have a table in bigquery which has some basic data segmented by date (main_table). I then have a new table which has the most recent 5 days worth of data (daily_transfer_table).
On a daily basis I'm looking to delete the last 5 days worth of data from my main_table, before inserting the most recent 5 days from the daily_transfer_table.
The reason for this is that there could be an issue with recent data so I always want to refresh the last 5 days, rather than just yesterday.
I'm not sure if this can be done with delete and insert or if I should be using another structure?
Many thanks for any help.
Mark
CodePudding user response:
You should try the merge statement. From what I understand, this example example of merging two tables is exactly what you need
CodePudding user response:
There is probably a couple of options here, consider the following approaches:
- As OjtamOjtam said, perform a merge. Depending on how much data there is though this might be more overhead than it is worth.
- Perform a truncate prior to the insert, this would delete are the rows but keep the schema intact. This is probably the simplest approach, but if the schema changes may not be your best approach.
- Delete the table as a whole and recreate. Very similar to the truncate with the big difference being you will need to recreate your table. This may be beneficial if you think the schema may change.
- Partition your table by ingestion time and constantly insert, potentially with a data expiration policy. This would allow you to be able to query the table very easily, just filtering based on the partition date and allow your data expiration policy to clear out old data. Again this would suffer from schema evolution but could be very effective for you.
Some helpful links: