I have a large JSON file about tens of GB, I adopted reading the file line by line, but it is slow to write about 10M for 1 minute. help me to modify the code
this is my code
import pandas as pd
import json
add_header = True
with open('1.json') as f_json:
for line in f_json:
line = line.strip()
df = pd.json_normalize(json.loads(line))
df.to_csv('1.csv', index=None, mode='a', header=add_header)
add_header = False
I also tried to use chunked reading, but got an error, the code:
import pandas as pd
import json
data = pd.read_json('G:\\1.json',
encoding='utf8',lines=True,chunksize=100000)
for df in data:
line = df.strip()
df = pd.json_normalize(json.loads(line))
df.to_csv('G:\\1.csv', index=None, mode='a',encoding='utf8')
output
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'strip'
Process finished with exit code -1
Here is my JSON file
{"_index":"core-bvd-dmc","_type":"_doc","_id":"e22762d5c4b81fbcad62b5c1d77226ec","_score":1,"_source":{"a_id":"P305906272","a_id_type":"Contact ID","a_name":"Mr Chuanzong Chen","a_name_normal":"MR CHUANZONG CHEN","a_job_title":"Executive director and general manager","relationship":"Currently works for (Executive director and general manager)","b_id":"CN9390051924","b_id_type":"BVD ID","b_name":"Yantai haofeng trade co., ltd.","b_name_normal":"YANTAI HAOFENG TRADE CO","b_country_code":"CN","b_country":"China","b_in_compliance_db":false,"b_nationality":"CN","b_street_address":"Bei da jie 53hao 1609shi; Zhi fu qu","b_city":"Yantai","b_postcode":"264000","b_region":"East China|Shandong","b_phone":" 86 18354522200","b_email":"[email protected]","b_latitude":37.511873,"b_longitude":121.396883,"b_geo_accuracy":"Community","b_national_ids":{"Unified social credit code":["91370602073035263P"],"Trade register number":["370602200112047"],"NOC":["073035263"]},"dates":{"date_of_birth":null},"file_name":"/media/hedwig/iforce/data/BvD/s3-transfer/SuperTable_v3_json/dmc/part-00020-7b09c546-2adc-413e-9e68-18b300e205cf-c000.json","b_geo_point":{"lat":37.511873,"lon":121.396883}}}
{"_index":"core-bvd-dmc","_type":"_doc","_id":"97871f8842398794e380a748f5b82ea5","_score":1,"_source":{"a_id":"P305888975","a_id_type":"Contact ID","a_name":"Mr Hengchao Jiang","a_name_normal":"MR HENGCHAO JIANG","a_job_title":"Legal representative","relationship":"Currently works for (Legal representative)","b_id":"CN9390053357","b_id_type":"BVD ID","b_name":"Yantai ji hong educate request information co., ltd.","b_name_normal":"YANTAI JI HONG EDUCATE REQUEST INFORMATION CO","b_country_code":"CN","b_country":"China","b_in_compliance_db":false,"b_nationality":"CN","b_street_address":"Ying chun da jie 131hao nei 1hao; Lai shan qu","b_city":"Yantai","b_postcode":"264000","b_region":"East China|Shandong","b_phone":" 86 18694982900","b_email":"[email protected]","b_latitude":37.511873,"b_longitude":121.396883,"b_geo_accuracy":"Community","b_national_ids":{"NOC":["597807789"],"Trade register number":["370613200023836"],"Unified social credit code":["913706135978077898"]},"dates":{"date_of_birth":null},"file_name":"/media/hedwig/iforce/data/BvD/s3-transfer/SuperTable_v3_json/dmc/part-00020-7b09c546-2adc-413e-9e68-18b300e205cf-c000.json","b_geo_point":{"lat":37.511873,"lon":121.396883}}}
Read code in chunks,I set number of blocks=1
import pandas as pd
add_header = True
data = pd.read_json('G:\\1.json',
encoding='utf8',lines=True,chunksize=1)
for subdf in data:
# subdf is already a dataframe.
temp_df = pd.concat([subdf[['_index', '_id', '_score']], pd.json_normalize(subdf._source)], axis=1)
temp_df.to_csv('G:\\1.csv', index=None, header=add_header,mode='a',encoding='utf8')
add_header = False
output result,I converted 2 lines of JSON file, the first line is converted correctly, but the second line of JSON data has errors, the information of the keys '_index', '_id', '_score' and the information in "_source" are separated
_index,_id,_score,a_id,a_id_type,a_name,a_name_normal,a_job_title,relationship,b_id,b_id_type,b_name,b_name_normal,b_country_code,b_country,b_in_compliance_db,b_nationality,b_street_address,b_city,b_postcode,b_region,b_phone,b_email,b_latitude,b_longitude,b_geo_accuracy,file_name,b_national_ids.Unified social credit code,b_national_ids.Trade register number,b_national_ids.NOC,dates.date_of_birth,b_geo_point.lat,b_geo_point.lon
core-bvd-dmc,e22762d5c4b81fbcad62b5c1d77226ec,1,P305906272,Contact ID,Mr Chuanzong Chen,MR CHUANZONG CHEN,Executive director and general manager,Currently works for (Executive director and general manager),CN9390051924,BVD ID,"Yantai haofeng trade co., ltd.",YANTAI HAOFENG TRADE CO,CN,China,False,CN,Bei da jie 53hao 1609shi; Zhi fu qu,Yantai,264000,East China|Shandong, 86 18354522200,[email protected],37.511873,121.396883,Community,/media/hedwig/iforce/data/BvD/s3-transfer/SuperTable_v3_json/dmc/part-00020-7b09c546-2adc-413e-9e68-18b300e205cf-c000.json,['91370602073035263P'],['370602200112047'],['073035263'],,37.511873,121.396883
,,,P305888975,Contact ID,Mr Hengchao Jiang,MR HENGCHAO JIANG,Legal representative,Currently works for (Legal representative),CN9390053357,BVD ID,"Yantai ji hong educate request information co., ltd.",YANTAI JI HONG EDUCATE REQUEST INFORMATION CO,CN,China,False,CN,Ying chun da jie 131hao nei 1hao; Lai shan qu,Yantai,264000,East China|Shandong, 86 18694982900,[email protected],37.511873,121.396883,Community,/media/hedwig/iforce/data/BvD/s3-transfer/SuperTable_v3_json/dmc/part-00020-7b09c546-2adc-413e-9e68-18b300e205cf-c000.json,['597807789'],['370613200023836'],['913706135978077898'],,37.511873,121.396883
core-bvd-dmc,97871f8842398794e380a748f5b82ea5,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
CodePudding user response:
chunksize
is returning the iterator of dataframes, so you cannot do strip
or json.loads
on that.
You would probably need
for subdf in data:
# subdf is already a dataframe.
temp_df = pd.concat([
subdf[['_index', '_id', '_score']].reset_index(drop=True),
pd.json_normalize(subdf._source)
], axis=1)
temp_df.to_csv(filename, index=None, mode='a',encoding='utf8')
You can modify the pd.concat
line to flatten/extract the data you want but I hope you get the idea.
Another thought I have is that although, csv can hold large data better than JSON but would you consider chunk the output csv into multiple files instead of creating a huge csv?