I am relatively new to JSON and after going through several tutorials I have decided to reach out here for additional advice. I am trying to convert the following JSON dictionary via Jupyternotebook/Pandas into a dataframe and later into a csv file (data is extracted from a public API).
Example JSON dictionary without indentation:
json_search_result =
{'inactive_count': 0, 'links': {'self': '/company/08892216/officers'}, 'etag': '1ccc4c74454bc359a9b9ef686d11e70cc03b6cb8', 'kind': 'officer-list', 'resigned_count': 0, 'total_results': 1, 'active_count': 1, 'items_per_page': 35, 'items': [{'date_of_birth': {'year': 1964, 'month': 2}, 'links': {'self': '/company/08892216/appointments/Ao1-pIsFQHy0FVSv26y6ep4Qhyw', 'officer': {'appointments': '/officers/JpJgO2pKVmGq5VHD7j7lluLW-jo/appointments'}}, 'officer_role': 'director', 'address': {'premises': '19', 'locality': 'London', 'country': 'United Kingdom', 'address_line_1': 'Norcott Road', 'postal_code': 'N16 7EJ'}, 'appointed_on': '2014-02-12', 'nationality': 'Irish', 'country_of_residence': 'England', 'name': 'FLOOD, Eileen', 'occupation': 'Company Director'}], 'start_index': 0}
With indentation for better visibility:
json_search_result =
{'inactive_count': 0, 'links': {'self': '/company/08892216/officers'},
'etag': '1ccc4c74454bc359a9b9ef686d11e70cc03b6cb8', 'kind': 'officer-list',
'resigned_count': 0, 'total_results': 1, 'active_count': 1, 'items_per_page': 35, 'items':
[{'date_of_birth': {'year': 1964, 'month': 2}, 'links': {'self': '/company/08892216/appointments/Ao1-pIsFQHy0FVSv26y6ep4Qhyw', 'officer': {'appointments':
'/officers/JpJgO2pKVmGq5VHD7j7lluLW-jo/appointments'}}, 'officer_role': 'director', 'address': {'premises': '19', 'locality': 'London', 'country': 'United Kingdom',
'address_line_1': 'Norcott Road', 'postal_code': 'N16 7EJ'},
'appointed_on': '2014-02-12', 'nationality': 'Irish', 'country_of_residence': 'England',
'name': 'FLOOD, Eileen', 'occupation': 'Company Director'}], 'start_index': 0}
I have tried among others json_normalize but I either get errors given this is a dictionary:
df2 = pd.json_normalize(json_search_result)
Thank you for any advice on this issues!
CodePudding user response:
Please try this if this helps your problem. Good reference for further improvements: Convert list of dictionaries to a pandas DataFrame
json_search_result = {'inactive_count': 0, 'links': {'self': '/company/08892216/officers'}, 'etag': '1ccc4c74454bc359a9b9ef686d11e70cc03b6cb8', 'kind': 'officer-list', 'resigned_count': 0, 'total_results': 1, 'active_count': 1, 'items_per_page': 35, 'items': [{'date_of_birth': {'year': 1964, 'month': 2}, 'links': {'self': '/company/08892216/appointments/Ao1-pIsFQHy0FVSv26y6ep4Qhyw', 'officer': {'appointments': '/officers/JpJgO2pKVmGq5VHD7j7lluLW-jo/appointments'}}, 'officer_role': 'director', 'address': {'premises': '19', 'locality': 'London', 'country': 'United Kingdom', 'address_line_1': 'Norcott Road', 'postal_code': 'N16 7EJ'}, 'appointed_on': '2014-02-12', 'nationality': 'Irish', 'country_of_residence': 'England', 'name': 'FLOOD, Eileen', 'occupation': 'Company Director'}], 'start_index': 0}
pprint(json_search_result)
out = pd.DataFrame.from_dict(json_search_result, orient='index').T
print(out)
print(out.columns)
CodePudding user response:
I assume that you will also want to flatten your dataframe since it contains additional nested information. So start by transforming you json string to a dataframe:
json_search_result = {'inactive_count': 0, 'links': {'self': '/company/08892216/officers'}, 'etag': '1ccc4c74454bc359a9b9ef686d11e70cc03b6cb8', 'kind': 'officer-list', 'resigned_count': 0, 'total_results': 1, 'active_count': 1, 'items_per_page': 35, 'items': [{'date_of_birth': {'year': 1964, 'month': 2}, 'links': {'self': '/company/08892216/appointments/Ao1-pIsFQHy0FVSv26y6ep4Qhyw', 'officer': {'appointments': '/officers/JpJgO2pKVmGq5VHD7j7lluLW-jo/appointments'}}, 'officer_role': 'director', 'address': {'premises': '19', 'locality': 'London', 'country': 'United Kingdom', 'address_line_1': 'Norcott Road', 'postal_code': 'N16 7EJ'}, 'appointed_on': '2014-02-12', 'nationality': 'Irish', 'country_of_residence': 'England', 'name': 'FLOOD, Eileen', 'occupation': 'Company Director'}], 'start_index': 0}
A = pd.DataFrame.from_dict(json_search_result, orient='index').T
which gives:
inactive_count links \
0 0 {'self': '/company/08892216/officers'}
etag kind resigned_count \
0 1ccc4c74454bc359a9b9ef686d11e70cc03b6cb8 officer-list 0
total_results active_count items_per_page \
0 1 1 35
items start_index
0 [{'date_of_birth': {'year': 1964, 'month': 2},... 0
Now, this function will flatten ANY dataframe with nested json strings:
def flatten_nested_json_df(df):
df = df.reset_index()
s = (df.applymap(type) == list).all()
list_columns = s[s].index.tolist()
s = (df.applymap(type) == dict).all()
dict_columns = s[s].index.tolist()
while len(list_columns) > 0 or len(dict_columns) > 0:
new_columns = []
for col in dict_columns:
exploded = pd.json_normalize(df[col]).add_prefix(f'{col}.')
exploded.index = df.index
df = pd.concat([df, exploded], axis=1).drop(columns=[col])
new_columns.extend(exploded.columns) # inplace
for col in list_columns:
# print(f"exploding: {col}")
df = df.drop(columns=[col]).join(df[col].explode().to_frame())
new_columns.append(col)
s = (df[new_columns].applymap(type) == list).all()
list_columns = s[s].index.tolist()
s = (df[new_columns].applymap(type) == dict).all()
dict_columns = s[s].index.tolist()
return df
Applying it:
flatten_nested_json_df(A)
will result in:
index inactive_count etag \
0 0 0 1ccc4c74454bc359a9b9ef686d11e70cc03b6cb8
kind resigned_count total_results active_count items_per_page \
0 officer-list 0 1 1 35
start_index links.self ... items.occupation \
0 0 /company/08892216/officers ... Company Director
items.date_of_birth.year items.date_of_birth.month \
0 1964 2
items.links.self \
0 /company/08892216/appointments/Ao1-pIsFQHy0FVS...
items.links.officer.appointments items.address.premises \
0 /officers/JpJgO2pKVmGq5VHD7j7lluLW-jo/appointm... 19
items.address.locality items.address.country items.address.address_line_1 \
0 London United Kingdom Norcott Road
items.address.postal_code
0 N16 7EJ
As you can see, the nested fields have been made to columns:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 25 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 index 1 non-null int64
1 inactive_count 1 non-null object
2 etag 1 non-null object
3 kind 1 non-null object
4 resigned_count 1 non-null object
5 total_results 1 non-null object
6 active_count 1 non-null object
7 items_per_page 1 non-null object
8 start_index 1 non-null object
9 links.self 1 non-null object
10 items.officer_role 1 non-null object
11 items.appointed_on 1 non-null object
12 items.nationality 1 non-null object
13 items.country_of_residence 1 non-null object
14 items.name 1 non-null object
15 items.occupation 1 non-null object
16 items.date_of_birth.year 1 non-null int64
17 items.date_of_birth.month 1 non-null int64
18 items.links.self 1 non-null object
19 items.links.officer.appointments 1 non-null object
20 items.address.premises 1 non-null object
21 items.address.locality 1 non-null object
22 items.address.country 1 non-null object
23 items.address.address_line_1 1 non-null object
24 items.address.postal_code 1 non-null object
dtypes: int64(3), object(22)
memory usage: 328.0 bytes
None