python I got a complex multiple nested JSON file, how to convert to csv file-CodePudding

Here is my code, It can only convert part of the JSON file, it fails to flatten all JSON，Unable to convert all files

import pandas as pd
import json
all_data = []
add_header = True
with open('C:\\Users\\jeri\\Desktop\\1.json',encoding='utf-8') as f_json:
    for line in f_json:
        line = line.strip()
        if line:
            all_data.append(json.loads(line))
df = pd.json_normalize(all_data)
df.to_csv('C:\\Users\\jeri\\Desktop\\11.csv', index=False,encoding='utf-8',header=add_header)
add_header = False

my json file


{"id":"aa","sex":"male","name":[{"Fn":"jeri","Ln":"teri"}],"age":45,"info":[{"address":{"State":"NY","City":"new york"},"start_date":"2001-09","title":{"name":"Doctor","Exp":"head"},"year":"2001","month":"05"}],"other":null,"Hobby":[{"smoking":null,"gamble":null}],"connect":[{"phone":"123456789","email":"[email protected]"}],"Education":"MBA","School":{"State":"NY","City":"new york"}}
{"id":"aa","sex":"female","name":[{"Fn":"lo","Ln":"li"}],"age":34,"info":[{"address":{"State":"NY","City":"new york"},"start_date":"2008-11","title":{"name":"Doctor","Exp":"hand"},"year":"2008","month":"02"}],"other":null,"Hobby":[{"smoking":null,"gamble":null}],"connect":[{"phone":"123456789","email":"[email protected]"}],"Education":"MBA","School":{"State":"NY","City":"new york"}}

The result of the conversion is below,Not all json files are converted，this is not what i want，I need to flatten and convert all files,

id,sex,name,age,info,other,Hobby,connect,Education,School.State,School.City
aa,male,"[{'Fn': 'jeri', 'Ln': 'teri'}]",45,"[{'address': {'State': 'NY', 'City': 'new york'}, 'start_date': '2001-09', 'title': {'name': 'Doctor', 'Exp': 'head'}, 'year': '2001', 'month': '05'}]",,"[{'smoking': None, 'gamble': None}]","[{'phone': '123456789', 'email': '[email protected]'}]",MBA,NY,new york
aa,female,"[{'Fn': 'lo', 'Ln': 'li'}]",34,"[{'address': {'State': 'NY', 'City': 'new york'}, 'start_date': '2008-11', 'title': {'name': 'Doctor', 'Exp': 'hand'}, 'year': '2008', 'month': '02'}]",,"[{'smoking': None, 'gamble': None}]","[{'phone': '123456789', 'email': '[email protected]'}]",MBA,NY,new york

new code

import pandas as pd
import json
data = []
add_header = True
with open('C:\\Users\\jeri\\Desktop\\1.json',encoding='utf-8') as f_json:
    for line in f_json:
        line = line.strip()
        if line:
            data.append(json.loads(line))
df = pd.json_normalize(data)
dfe = explode('name').explode('info').explode("Hobby"),pd.concat([df, pd.json_normalize(df.name),
                                                                  pd.json_normalize(df.info),
                                                                  pd.json_normalize(df.Hobby)], axis=1)
dfe.to_csv('C:\\Users\\jeri\\Desktop\\11.csv', index=False,encoding='utf-8',header=add_header)
add_header = False

output

id,sex,age,other,Education,School,Fn,Ln,start_date,year,month,address.State,address.City,title.name,title.Exp,phone,email,smoking,gamble
aa,male,45,,MBA,"{'State': 'NY', 'City': 'new york'}",jeri,teri,2001-09,2001,05,NY,new york,Doctor,head,123456789,[email protected],,
aa,female,34,,MBA,"{'State': 'NY', 'City': 'new york'}",lo,li,2008-11,2008,02,NY,new york,Doctor,hand,123456789,[email protected],,

CodePudding user response：

pd.json_normalize flattens the dictionary to columns. When you have list, you need to explode the list that transform the list into rows. If you have dictionary inside the list, you need to apply the json_normalize again on the exploded column.

import pandas as pd

data = []
# read_json normalizes the top layer.
df_it = pd.read_json('C:\\Users\\jeri\\Desktop\\1.json', encoding='utf-8', lines=True, chunksize=100000)
for sub in df_it:
    data.append(sub)

df = pd.concat(data)

# You just need to flatten the list of dict additionally.
# If you are using pandas >= 1.3.0, df = df.explode(['name', 'info', 'Hobby'])
# Otherwise, explode only takes 1 column at a time.
df = df.explode('name').explode('info').explode("Hobby")
df = pd.concat([df.reset_index(drop=True), 
                pd.json_normalize(df.name), 
                # info is reserved keyword (dataframe function name, you cannot use df.info), 
                # change it to accessing by df['info']
                pd.json_normalize(df['info']), 
                pd.json_normalize(df.Hobby)], axis=1)
# Drop the exploded column, the contents inside the column is extracted to columns already
df = df.drop(['name', 'info', 'Hobby'], axis=1)

df.to_csv('C:\\Users\\jeri\\Desktop\\11.csv', index=False, encoding='utf-8')

Another thing to consider is when you have multiple json_normalize, you need to make sure that key in each dict column is not shared amongst the other columns you try to json_normalize. In your sample, I didn't see any of same key name, so I did concat and json_normalize in 1 shot.

However, if you have same key name in dict, you need to take step by step. Do json_normalize on 1 exploded column and rename column(s), then do another json_normalize on another exploded column.