Home > Software engineering >  How to update big query data according to mapping file?
How to update big query data according to mapping file?

Time:10-13

I have a mapping file which is stored in Gcs and a big query table.

df_mapping:

             Id     Values
             1      XZUP
             2      SJXC
             3      PALD
             4      QLOM
             5      DKCM

BQ_Table:

Id  Country  Market  Sales  Values
1   Canada   Hsp     2503   XZUP
2   Germany  Noe     2459   SJXC
3   Algeria  Zoe     4635   PALD
4   Brazil   Foe     6354   QLOM
5   Canada   Cmm     2588   XZUP

But the mapping files changes everytime, i am looking for a solution if the mapping file changes my BQ table value should also update as per mapping file.

what i did: Everytime mapping file changes, i am triggering a function where i am reading bq table except "value" column & reading updated mapping file --> left join on Id column to get updated "values" --> Deleting my old bq table --> Inserting the new data.

query =  """
            SELECT
            Id,
            Country,
            Sales,
            Value
            FROM `project.dataset.tbl` 
            
        """

            bqclient = bigquery.Client()
            df = (
                bqclient.query(query)
                .result()
                .to_dataframe(create_bqstorage_client=True)
                )

df_mapping = pd.read_csv("gs://path/mapping.csv")
df_final = pd.merge(df, df_mapping, on='Id', how='left')
-- Not sure of Deleting and Inserting data safely

I am not sure, deleting my old table and inserting new table is a right way to do for this problem.

Problems I see:

          1. After deleting my old table, error can come while inserting new data.
          2. Data is quite large to process ~1million.
          3. Not Scalable solution.
          4. Can loss Data.

Is there any other way to do it, something like within BQ Table it refers my mapping file or any other solution to this task would be great.

CodePudding user response:

You could store your mapping referential somewhere (in GCS in CVS format, in Google Sheet for easy manual update, in Cloud SQL database maybe), and then create a view that join your data and the mapping.

Like that, you always have a up to date date, you don't need to store the result and to manage it (update/delete/re-create)

CodePudding user response:

Let's give this a try -

  • Create a scheduled UPDATE query ('on demand') in BigQuery. Herein you can have GCS mapping file as first bq table (external table) and your BQ_Table as another bq table. Simply write an update query to update Values field in BQ_Table per mapping file.
  • Create a Cloud Function to run above scheduled query as soon mapping data in GCS is updated.
  • Related