I have a list of dictionaries with a length of more than 100,000 .
How will I be able to convert this in JSON and write it in a JSON file following the requirement of Bigquery to create a JSON file with a newline.
{"id":"1","first_name":"John","last_name":"Doe","dob":"1968-01-22","addresses":[{"status":"current","address":"123 First Avenue","city":"Seattle","state":"WA","zip":"11111","numberOfYears":"1"},{"status":"previous","address":"456 Main Street","city":"Portland","state":"OR","zip":"22222","numberOfYears":"5"}]}
{"id":"2","first_name":"Jane","last_name":"Doe","dob":"1980-10-16","addresses":[{"status":"current","address":"789 Any Avenue","city":"New York","state":"NY","zip":"33333","numberOfYears":"2"},{"status":"previous","address":"321 Main Street","city":"Hoboken","state":"NJ","zip":"44444","numberOfYears":"3"}]}
instead of
[{"id":"1","first_name":"John","last_name":"Doe","dob":"1968-01-22","addresses":[{"status":"current","address":"123 First Avenue","city":"Seattle","state":"WA","zip":"11111","numberOfYears":"1"},{"status":"previous","address":"456 Main Street","city":"Portland","state":"OR","zip":"22222","numberOfYears":"5"}]}, {"id":"2","first_name":"Jane","last_name":"Doe","dob":"1980-10-16","addresses":[{"status":"current","address":"789 Any Avenue","city":"New York","state":"NY","zip":"33333","numberOfYears":"2"},{"status":"previous","address":"321 Main Street","city":"Hoboken","state":"NJ","zip":"44444","numberOfYears":"3"}]}]
Notice the difference between two JSON: The first one is newline seperated while the second one is comma seperated (Normal JSON dumping in Python). I need the first one.
What I did before is in the last part of loop, I am doing this:
while condition:
with open('cache/name.json', 'a') as a:
json_data = json.dumps(store)
a.write(json_data '\n')
Doing this, I am opening and writing base on the length of list of dictionaries which makes the loop slow.
How will I be able to insert this in a faster way following the requirement of bigquery?
CodePudding user response:
This format is called NEWLINE_DELIMITED_JSON and bigquery has inbuilt libraries to load it. Considering you have the json in gs bucket, here is what you may use :
from google.cloud import bigquery
# Construct a BigQuery client object.
client = bigquery.Client()
# TODO(developer): Set table_id to the ID of the table to create.
# table_id = "your-project.your_dataset.your_table_name"
job_config = bigquery.LoadJobConfig(
schema=[
bigquery.SchemaField("name", "STRING"),
bigquery.SchemaField("post_abbr", "STRING"),
],
source_format=bigquery.SourceFormat.NEWLINE_DELIMITED_JSON,
)
uri = "gs://cloud-samples-data/bigquery/us-states/us-states.json"
load_job = client.load_table_from_uri(
uri,
table_id,
location="US", # Must match the destination dataset location.
job_config=job_config,
) # Make an API request.
load_job.result() # Waits for the job to complete.
destination_table = client.get_table(table_id)
print("Loaded {} rows.".format(destination_table.num_rows))
CodePudding user response:
Consider looping through the list of dictionaries after the JSON
file has been accessed in write mode. This way, the file closing will happen only once after the dictionaries have been written to the JSON
file. The below code ran quicker than when the file access was inside the while
loop where the file close happens for every iteration of the loop.
import json
list_dict = [{"id":"1","first_name":"John","last_name":"Doe","dob":"1968-01-22","addresses":[{"status":"current","address":"123 First Avenue","city":"Seattle","state":"WA","zip":"11111","numberOfYears":"1"},{"status":"previous","address":"456 Main Street","city":"Portland","state":"OR","zip":"22222","numberOfYears":"5"}]},
{"id":"2","first_name":"Jane","last_name":"Doe","dob":"1980-10-16","addresses":[{"status":"current","address":"789 Any Avenue","city":"New York","state":"NY","zip":"33333","numberOfYears":"2"},{"status":"previous","address":"321 Main Street","city":"Hoboken","state":"NJ","zip":"44444","numberOfYears":"3"}]}]
list_dict = list_dict * 55000 ## Fills the list with 110,000 elements
with open ("sample-json-data.json", "w") as jsonwrite:
for item in list_dict:
jsonwrite.write(json.dumps(item) '\n')