I have a mapping file which is stored in Gcs and a big query table.
df_mapping:
Id Values
1 XZUP
2 SJXC
3 PALD
4 QLOM
5 DKCM
BQ_Table:
Id Country Market Sales Values
1 Canada Hsp 2503 XZUP
2 Germany Noe 2459 SJXC
3 Algeria Zoe 4635 PALD
4 Brazil Foe 6354 QLOM
5 Canada Cmm 2588 XZUP
But the mapping files changes everytime, i am looking for a solution if the mapping file changes my BQ table value should also update as per mapping file.
what i did: Everytime mapping file changes, i am triggering a function where i am reading bq table except "value" column & reading updated mapping file --> left join on Id column to get updated "values" --> Deleting my old bq table --> Inserting the new data.
query = """
SELECT
Id,
Country,
Sales,
Value
FROM `project.dataset.tbl`
"""
bqclient = bigquery.Client()
df = (
bqclient.query(query)
.result()
.to_dataframe(create_bqstorage_client=True)
)
df_mapping = pd.read_csv("gs://path/mapping.csv")
df_final = pd.merge(df, df_mapping, on='Id', how='left')
-- Not sure of Deleting and Inserting data safely
I am not sure, deleting my old table and inserting new table is a right way to do for this problem.
Problems I see:
1. After deleting my old table, error can come while inserting new data.
2. Data is quite large to process ~1million.
3. Not Scalable solution.
4. Can loss Data.
Is there any other way to do it, something like within BQ Table it refers my mapping file or any other solution to this task would be great.
CodePudding user response:
You could store your mapping referential somewhere (in GCS in CVS format, in Google Sheet for easy manual update, in Cloud SQL database maybe), and then create a view that join your data and the mapping.
Like that, you always have a up to date date, you don't need to store the result and to manage it (update/delete/re-create)
CodePudding user response:
Let's give this a try -
- Create a scheduled UPDATE query ('on demand') in BigQuery. Herein you can have GCS mapping file as first bq table (external table) and your
BQ_Table
as another bq table. Simply write an update query to updateValues
field inBQ_Table
per mapping file. - Create a Cloud Function to run above scheduled query as soon mapping data in GCS is updated.