I have a my_table
in BigQuery that is being populated on the first of each month. The data looks like this:
date rate
01/01/22 1.5
01/02/22 1.4
...
01/31/22 1.7
On March 1st, 2022 I will have February data (day by day). But while the data for February is not available I need to append rows every day with current_date and null for rate
column and then overwrite it once the data is available on March 1st.
Desired output is:
date rate
01/01/22 1.5
01/02/22 1.4
...
01/31/22 1.7
02/01/22 null
02/02/22 null
...
02/17/22 null
CodePudding user response:
You can use the following logic to automate your process:
On the first of every month, check for the arrival of a new file. If its there, add its content to a separate table (call it
monthly_table
).On a daily basis, run a query which combines the output from
monthly_table
andmy_table
and overwritesmy_table
with the merged output.
For (#1), you can run a command which loads data from your file (assuming its a CSV) into monthly_table
. This can be done with bq
command line utility, with --noreplace
flag to ensure that the new data is always appended:
bq load \
--source_format=CSV \
--skip_leading_rows=2 \
--noreplace \
mydataset.monthly_table \
gs://mybucket/data_for_2022_02_01.csv \
'[{"name": "date", "type": "STRING"}, {"name": "rate", "type": "FLOAT64"}]'
For (#2), you can run a query on daily basis which merges both tables, and then overwrite the my_table
with the output:
select date, max(rate) as rate from (
select date, rate from mydataset.monthly_table
union all
select date, rate from mydataset.my_table
union all
select format_date('%m/%d/%Y', current_date()) as date, null as rate
)
group by date
order by date
You can also use bq
utility to run the second query (with --replace=true
) flag.