python json to csv Reading line by line is slow-CodePudding

I have a large JSON file about tens of GB, I adopted reading the file line by line, but it is slow to write about 10M for 1 minute. help me to modify the code

this is my code



import pandas as pd
import json

add_header = True

with open('1.json') as f_json:
    for line in f_json:
        line = line.strip()
        
        df = pd.json_normalize(json.loads(line))
        df.to_csv('1.csv', index=None, mode='a', header=add_header)
        add_header = False

I also tried to use chunked reading, but got an error, the code:

import pandas as pd
import json


data = pd.read_json('G:\\1.json',
                           encoding='utf8',lines=True,chunksize=100000)
for df in data:
    line = df.strip()
    df = pd.json_normalize(json.loads(line))
    df.to_csv('G:\\1.csv', index=None, mode='a',encoding='utf8')

output

    return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'strip'
Process finished with exit code -1

Here is my JSON file

{"_index":"core-bvd-dmc","_type":"_doc","_id":"e22762d5c4b81fbcad62b5c1d77226ec","_score":1,"_source":{"a_id":"P305906272","a_id_type":"Contact ID","a_name":"Mr Chuanzong Chen","a_name_normal":"MR CHUANZONG CHEN","a_job_title":"Executive director and general manager","relationship":"Currently works for (Executive director and general manager)","b_id":"CN9390051924","b_id_type":"BVD ID","b_name":"Yantai haofeng trade co., ltd.","b_name_normal":"YANTAI HAOFENG TRADE CO","b_country_code":"CN","b_country":"China","b_in_compliance_db":false,"b_nationality":"CN","b_street_address":"Bei da jie 53hao 1609shi; Zhi fu qu","b_city":"Yantai","b_postcode":"264000","b_region":"East China|Shandong","b_phone":" 86 18354522200","b_email":"[email protected]","b_latitude":37.511873,"b_longitude":121.396883,"b_geo_accuracy":"Community","b_national_ids":{"Unified social credit code":["91370602073035263P"],"Trade register number":["370602200112047"],"NOC":["073035263"]},"dates":{"date_of_birth":null},"file_name":"/media/hedwig/iforce/data/BvD/s3-transfer/SuperTable_v3_json/dmc/part-00020-7b09c546-2adc-413e-9e68-18b300e205cf-c000.json","b_geo_point":{"lat":37.511873,"lon":121.396883}}}
{"_index":"core-bvd-dmc","_type":"_doc","_id":"97871f8842398794e380a748f5b82ea5","_score":1,"_source":{"a_id":"P305888975","a_id_type":"Contact ID","a_name":"Mr Hengchao Jiang","a_name_normal":"MR HENGCHAO JIANG","a_job_title":"Legal representative","relationship":"Currently works for (Legal representative)","b_id":"CN9390053357","b_id_type":"BVD ID","b_name":"Yantai ji hong educate request information co., ltd.","b_name_normal":"YANTAI JI HONG EDUCATE REQUEST INFORMATION CO","b_country_code":"CN","b_country":"China","b_in_compliance_db":false,"b_nationality":"CN","b_street_address":"Ying chun da jie 131hao nei 1hao; Lai shan qu","b_city":"Yantai","b_postcode":"264000","b_region":"East China|Shandong","b_phone":" 86 18694982900","b_email":"[email protected]","b_latitude":37.511873,"b_longitude":121.396883,"b_geo_accuracy":"Community","b_national_ids":{"NOC":["597807789"],"Trade register number":["370613200023836"],"Unified social credit code":["913706135978077898"]},"dates":{"date_of_birth":null},"file_name":"/media/hedwig/iforce/data/BvD/s3-transfer/SuperTable_v3_json/dmc/part-00020-7b09c546-2adc-413e-9e68-18b300e205cf-c000.json","b_geo_point":{"lat":37.511873,"lon":121.396883}}}

Read code in chunks，I set number of blocks=1

import pandas as pd

add_header = True
data = pd.read_json('G:\\1.json',
                           encoding='utf8',lines=True,chunksize=1)
for subdf in data:
    # subdf is already a dataframe.
    temp_df = pd.concat([subdf[['_index', '_id', '_score']], pd.json_normalize(subdf._source)], axis=1)
    temp_df.to_csv('G:\\1.csv', index=None, header=add_header,mode='a',encoding='utf8')
    add_header = False

output result，I converted 2 lines of JSON file, the first line is converted correctly, but the second line of JSON data has errors, the information of the keys '_index', '_id', '_score' and the information in "_source" are separated

_index,_id,_score,a_id,a_id_type,a_name,a_name_normal,a_job_title,relationship,b_id,b_id_type,b_name,b_name_normal,b_country_code,b_country,b_in_compliance_db,b_nationality,b_street_address,b_city,b_postcode,b_region,b_phone,b_email,b_latitude,b_longitude,b_geo_accuracy,file_name,b_national_ids.Unified social credit code,b_national_ids.Trade register number,b_national_ids.NOC,dates.date_of_birth,b_geo_point.lat,b_geo_point.lon
core-bvd-dmc,e22762d5c4b81fbcad62b5c1d77226ec,1,P305906272,Contact ID,Mr Chuanzong Chen,MR CHUANZONG CHEN,Executive director and general manager,Currently works for (Executive director and general manager),CN9390051924,BVD ID,"Yantai haofeng trade co., ltd.",YANTAI HAOFENG TRADE CO,CN,China,False,CN,Bei da jie 53hao 1609shi; Zhi fu qu,Yantai,264000,East China|Shandong, 86 18354522200,[email protected],37.511873,121.396883,Community,/media/hedwig/iforce/data/BvD/s3-transfer/SuperTable_v3_json/dmc/part-00020-7b09c546-2adc-413e-9e68-18b300e205cf-c000.json,['91370602073035263P'],['370602200112047'],['073035263'],,37.511873,121.396883
,,,P305888975,Contact ID,Mr Hengchao Jiang,MR HENGCHAO JIANG,Legal representative,Currently works for (Legal representative),CN9390053357,BVD ID,"Yantai ji hong educate request information co., ltd.",YANTAI JI HONG EDUCATE REQUEST INFORMATION CO,CN,China,False,CN,Ying chun da jie 131hao nei 1hao; Lai shan qu,Yantai,264000,East China|Shandong, 86 18694982900,[email protected],37.511873,121.396883,Community,/media/hedwig/iforce/data/BvD/s3-transfer/SuperTable_v3_json/dmc/part-00020-7b09c546-2adc-413e-9e68-18b300e205cf-c000.json,['597807789'],['370613200023836'],['913706135978077898'],,37.511873,121.396883
core-bvd-dmc,97871f8842398794e380a748f5b82ea5,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,

CodePudding user response：

chunksize is returning the iterator of dataframes, so you cannot do strip or json.loads on that.

You would probably need

for subdf in data:
    # subdf is already a dataframe.  
    temp_df = pd.concat([
        subdf[['_index', '_id', '_score']].reset_index(drop=True), 
        pd.json_normalize(subdf._source)
    ], axis=1)
    temp_df.to_csv(filename, index=None, mode='a',encoding='utf8')

You can modify the pd.concat line to flatten/extract the data you want but I hope you get the idea.

Another thought I have is that although, csv can hold large data better than JSON but would you consider chunk the output csv into multiple files instead of creating a huge csv?