Home > Enterprise >  Python load huge csv to postgresql
Python load huge csv to postgresql

Time:11-15

Here is the program i am developing in python -

Step1 - We will get JSON file ( size could be in GBs e.g 50 GB or more ) from source to our server - Step2 - I use Pandas Dataframe to load JSON in to DF using df = pd.read_json(jsonfile,index=False, header=False

Step3 - I use df.to_csv(temp_csvfile,..)

Steps4 - I use Python psycopg2 to make Postgresql connection and cursor ..

curr=conn.cursor() ```
Step5 - Read the CSV and load using copy_from 
 with open(temp_csvfile,'r') as f:
     curr.copy_from(f,..)
 conn.commit()

I seek feedback on below points -

a. Will this way of loading JSON to Pandas Dataframe not cause out of memory issue if my system memory is < size of the JSON file ..

b. At step 5 again i am opening file in read mode will same issue come here as it might load file in memory ( am i missing anything here )

c. Is there any better way of doing this ..

d. Can Python DASK will be used as it provides reading data in chunks ( i am not familiar with this).

Please advise

CodePudding user response:

You could split your input json file into many smaller files, and also use the chunk size parameter while reading file content into pandas dataframe. Also, use the psycopg2 copy_from function which supports a buffer size parameter. In fact you could use execute_batch() to get batches of rows inserted into your Postgresql table, as in article mentioned in reference below.

References :

  1. Loading 20gb json file in pandas
  2. Loading dataframes data into postgresql table article
  3. Read a large json file into pandas
  • Related