Bigquery Column addition and historical update-CodePudding

We have a bigquery table which holds around 50 GB of logistics data. Recently due to business requirement we had to introduce a new column which works pretty well for the new data but we want to fix the historical data as well for this column.

This personal column is pushed from python and is derived from logistics column which is pipe delimited.

Note :We have data till 2021-01-01 which needs to be updated.

What is the fastest way to achieve this using SQL ?

CodePudding user response：

I think you can use an update query to calculate the Logistics columns for historical data :

update `my_project.my_dataset.my_table`
set Logistics = CONCAT(column1, "|", column2, "|", ....)
where 1 = 1

If your table is partitioned (maybe the column Date), you can add dates ranges in where clause if needed to not process all the table.

For example if historical data is until a date, you are not obliged to execute the update query for all the data in table, and in this case the cost will be cheaper.

CodePudding user response：

As per my understanding, Personal is actually derived from Y or N substring in logistics. If thats so, then you can fetch this substring using regex and later you could populate Personal based on this. As your table is in BigQuery so you can execute following command to do the same.[Kindly change regex as per your requirement]

e.g.  UPDATE `project.dataset.table_name`
      SET Personal = CASE 
                      WHEN REGEXP_EXTRACT(Logistics, r'\|[YN]\|') = '|Y|'
                      THEN 'Yes'
                      ELSE 'No'
                     END
      WHERE DATE <= DATE('2021-01-01')