Here is my code, It can only convert part of the JSON file, it fails to flatten all JSON,Unable to convert all files
import pandas as pd
import json
all_data = []
add_header = True
with open('C:\\Users\\jeri\\Desktop\\1.json',encoding='utf-8') as f_json:
for line in f_json:
line = line.strip()
if line:
all_data.append(json.loads(line))
df = pd.json_normalize(all_data)
df.to_csv('C:\\Users\\jeri\\Desktop\\11.csv', index=False,encoding='utf-8',header=add_header)
add_header = False
my json file
{"id":"aa","sex":"male","name":[{"Fn":"jeri","Ln":"teri"}],"age":45,"info":[{"address":{"State":"NY","City":"new york"},"start_date":"2001-09","title":{"name":"Doctor","Exp":"head"},"year":"2001","month":"05"}],"other":null,"Hobby":[{"smoking":null,"gamble":null}],"connect":[{"phone":"123456789","email":"[email protected]"}],"Education":"MBA","School":{"State":"NY","City":"new york"}}
{"id":"aa","sex":"female","name":[{"Fn":"lo","Ln":"li"}],"age":34,"info":[{"address":{"State":"NY","City":"new york"},"start_date":"2008-11","title":{"name":"Doctor","Exp":"hand"},"year":"2008","month":"02"}],"other":null,"Hobby":[{"smoking":null,"gamble":null}],"connect":[{"phone":"123456789","email":"[email protected]"}],"Education":"MBA","School":{"State":"NY","City":"new york"}}
The result of the conversion is below,Not all json files are converted,this is not what i want,I need to flatten and convert all files,
id,sex,name,age,info,other,Hobby,connect,Education,School.State,School.City
aa,male,"[{'Fn': 'jeri', 'Ln': 'teri'}]",45,"[{'address': {'State': 'NY', 'City': 'new york'}, 'start_date': '2001-09', 'title': {'name': 'Doctor', 'Exp': 'head'}, 'year': '2001', 'month': '05'}]",,"[{'smoking': None, 'gamble': None}]","[{'phone': '123456789', 'email': '[email protected]'}]",MBA,NY,new york
aa,female,"[{'Fn': 'lo', 'Ln': 'li'}]",34,"[{'address': {'State': 'NY', 'City': 'new york'}, 'start_date': '2008-11', 'title': {'name': 'Doctor', 'Exp': 'hand'}, 'year': '2008', 'month': '02'}]",,"[{'smoking': None, 'gamble': None}]","[{'phone': '123456789', 'email': '[email protected]'}]",MBA,NY,new york
new code
import pandas as pd
import json
data = []
add_header = True
with open('C:\\Users\\jeri\\Desktop\\1.json',encoding='utf-8') as f_json:
for line in f_json:
line = line.strip()
if line:
data.append(json.loads(line))
df = pd.json_normalize(data)
dfe = explode('name').explode('info').explode("Hobby"),pd.concat([df, pd.json_normalize(df.name),
pd.json_normalize(df.info),
pd.json_normalize(df.Hobby)], axis=1)
dfe.to_csv('C:\\Users\\jeri\\Desktop\\11.csv', index=False,encoding='utf-8',header=add_header)
add_header = False
output
id,sex,age,other,Education,School,Fn,Ln,start_date,year,month,address.State,address.City,title.name,title.Exp,phone,email,smoking,gamble
aa,male,45,,MBA,"{'State': 'NY', 'City': 'new york'}",jeri,teri,2001-09,2001,05,NY,new york,Doctor,head,123456789,[email protected],,
aa,female,34,,MBA,"{'State': 'NY', 'City': 'new york'}",lo,li,2008-11,2008,02,NY,new york,Doctor,hand,123456789,[email protected],,
CodePudding user response:
pd.json_normalize flattens the dictionary to columns. When you have list
, you need to explode the list that transform the list into rows.
If you have dictionary inside the list, you need to apply the json_normalize
again on the exploded column.
import pandas as pd
data = []
# read_json normalizes the top layer.
df_it = pd.read_json('C:\\Users\\jeri\\Desktop\\1.json', encoding='utf-8', lines=True, chunksize=100000)
for sub in df_it:
data.append(sub)
df = pd.concat(data)
# You just need to flatten the list of dict additionally.
# If you are using pandas >= 1.3.0, df = df.explode(['name', 'info', 'Hobby'])
# Otherwise, explode only takes 1 column at a time.
df = df.explode('name').explode('info').explode("Hobby")
df = pd.concat([df.reset_index(drop=True),
pd.json_normalize(df.name),
# info is reserved keyword (dataframe function name, you cannot use df.info),
# change it to accessing by df['info']
pd.json_normalize(df['info']),
pd.json_normalize(df.Hobby)], axis=1)
# Drop the exploded column, the contents inside the column is extracted to columns already
df = df.drop(['name', 'info', 'Hobby'], axis=1)
df.to_csv('C:\\Users\\jeri\\Desktop\\11.csv', index=False, encoding='utf-8')
Another thing to consider is when you have multiple json_normalize
, you need to make sure that key in each dict column is not shared amongst the other columns you try to json_normalize
. In your sample, I didn't see any of same key name, so I did concat
and json_normalize
in 1 shot.
However, if you have same key name in dict, you need to take step by step. Do json_normalize
on 1 exploded column and rename column(s), then do another json_normalize
on another exploded column.