How to upload an array of strings to BigQuery using python-CodePudding

I am trying to upload a dataframe to bigquery which contains a column that is a list of users.

A typical value in the column would be ['USER1', 'USER2', 'USER3'].

No matter how much I try though when trying to specify the BQ schema, I am never able to manage to get it to understand it is an array of strings.

Here is an example of what I tried for some reference.

bq_schema = [ 
            bigquery.SchemaField("reply_users", field_type='RECORD', fields=[bigquery.SchemaField("user", "STRING", mode="REPEATED")]),
        ]

I've tried and tried but I just cannot work out how to do this.

Any advice greatly appreciated.

CodePudding user response：

With that schema, your dataframe should look like this. See code below:

from google.cloud import bigquery
import pandas as pd

client = bigquery.Client()

table_id = "project_id.sandbox.nested_from_df"
job_config = bigquery.LoadJobConfig(
    schema=[
        bigquery.SchemaField("reply_users", field_type='RECORD', fields=[bigquery.SchemaField("user", "STRING", mode="REPEATED")]),
    ],
)
data = {"reply_users": [{ "user": ["user1","user2"]}]}
df = pd.DataFrame(data)
print(df)

job = client.load_table_from_dataframe(df, table_id, job_config=job_config)
job.result()

Output (dataframe):

Loaded data:

CodePudding user response：

Thanks a lot.

I've had another dig through this in the morning and see where I went wrong.

My data is in the format:

data = {"reply_users": ["user1","user2"]}

Given this, the correct format for the schema is :

job_config = bigquery.LoadJobConfig(
    schema=[
        bigquery.SchemaField("reply_users", field_type='STR', mode='REPEATED')

My issue was that some other function had converted the list in python into a string without me realising so this was just chopping up the string letter by letter.

Appreciate the help though, finally understand how this works !