Here is the program i am developing in python -
Step1 - We will get JSON file ( size could be in GBs e.g 50 GB or more ) from source to our server -
Step2 - I use Pandas Dataframe to load JSON in to DF using
df = pd.read_json(jsonfile,index=False, header=False
Step3 - I use df.to_csv(temp_csvfile,..)
Steps4 - I use Python psycopg2 to make Postgresql connection and cursor ..
curr=conn.cursor() ```
Step5 - Read the CSV and load using copy_from
with open(temp_csvfile,'r') as f:
curr.copy_from(f,..)
conn.commit()
I seek feedback on below points -
a. Will this way of loading JSON to Pandas Dataframe not cause out of memory issue if my system memory is < size of the JSON file ..
b. At step 5 again i am opening file in read mode will same issue come here as it might load file in memory ( am i missing anything here )
c. Is there any better way of doing this ..
d. Can Python DASK will be used as it provides reading data in chunks ( i am not familiar with this).
Please advise
CodePudding user response:
You could split your input json file into many smaller files, and also use the chunk
size parameter while reading file content into pandas dataframe. Also, use the psycopg2 copy_from
function which supports a buffer
size parameter.
In fact you could use execute_batch() to get batches of rows inserted into your Postgresql table, as in article mentioned in reference below.
References :
- Loading 20gb json file in pandas
- Loading dataframes data into postgresql table article
- Read a large json file into pandas