Converting JSON Dictionary to Dataframe-CodePudding

I am relatively new to JSON and after going through several tutorials I have decided to reach out here for additional advice. I am trying to convert the following JSON dictionary via Jupyternotebook/Pandas into a dataframe and later into a csv file (data is extracted from a public API).

Example JSON dictionary without indentation:

json_search_result = 
{'inactive_count': 0, 'links': {'self': '/company/08892216/officers'}, 'etag': '1ccc4c74454bc359a9b9ef686d11e70cc03b6cb8', 'kind': 'officer-list', 'resigned_count': 0, 'total_results': 1, 'active_count': 1, 'items_per_page': 35, 'items': [{'date_of_birth': {'year': 1964, 'month': 2}, 'links': {'self': '/company/08892216/appointments/Ao1-pIsFQHy0FVSv26y6ep4Qhyw', 'officer': {'appointments': '/officers/JpJgO2pKVmGq5VHD7j7lluLW-jo/appointments'}}, 'officer_role': 'director', 'address': {'premises': '19', 'locality': 'London', 'country': 'United Kingdom', 'address_line_1': 'Norcott Road', 'postal_code': 'N16 7EJ'}, 'appointed_on': '2014-02-12', 'nationality': 'Irish', 'country_of_residence': 'England', 'name': 'FLOOD, Eileen', 'occupation': 'Company Director'}], 'start_index': 0}

With indentation for better visibility:

json_search_result = 
{'inactive_count': 0, 'links': {'self': '/company/08892216/officers'},
 'etag': '1ccc4c74454bc359a9b9ef686d11e70cc03b6cb8', 'kind': 'officer-list', 
'resigned_count': 0, 'total_results': 1, 'active_count': 1, 'items_per_page': 35, 'items': 
[{'date_of_birth': {'year': 1964, 'month': 2}, 'links': {'self': '/company/08892216/appointments/Ao1-pIsFQHy0FVSv26y6ep4Qhyw', 'officer': {'appointments': 
'/officers/JpJgO2pKVmGq5VHD7j7lluLW-jo/appointments'}}, 'officer_role': 'director', 'address': {'premises': '19', 'locality': 'London', 'country': 'United Kingdom', 
'address_line_1': 'Norcott Road', 'postal_code': 'N16 7EJ'},
 'appointed_on': '2014-02-12', 'nationality': 'Irish', 'country_of_residence': 'England', 
'name': 'FLOOD, Eileen', 'occupation': 'Company Director'}], 'start_index': 0}

I have tried among others json_normalize but I either get errors given this is a dictionary:

df2 =  pd.json_normalize(json_search_result)

Thank you for any advice on this issues!

CodePudding user response：

Please try this if this helps your problem. Good reference for further improvements: Convert list of dictionaries to a pandas DataFrame

json_search_result = {'inactive_count': 0, 'links': {'self': '/company/08892216/officers'}, 'etag': '1ccc4c74454bc359a9b9ef686d11e70cc03b6cb8', 'kind': 'officer-list', 'resigned_count': 0, 'total_results': 1, 'active_count': 1, 'items_per_page': 35, 'items': [{'date_of_birth': {'year': 1964, 'month': 2}, 'links': {'self': '/company/08892216/appointments/Ao1-pIsFQHy0FVSv26y6ep4Qhyw', 'officer': {'appointments': '/officers/JpJgO2pKVmGq5VHD7j7lluLW-jo/appointments'}}, 'officer_role': 'director', 'address': {'premises': '19', 'locality': 'London', 'country': 'United Kingdom', 'address_line_1': 'Norcott Road', 'postal_code': 'N16 7EJ'}, 'appointed_on': '2014-02-12', 'nationality': 'Irish', 'country_of_residence': 'England', 'name': 'FLOOD, Eileen', 'occupation': 'Company Director'}], 'start_index': 0}

pprint(json_search_result)

out = pd.DataFrame.from_dict(json_search_result, orient='index').T
print(out)
print(out.columns)

CodePudding user response：

I assume that you will also want to flatten your dataframe since it contains additional nested information. So start by transforming you json string to a dataframe:

json_search_result = {'inactive_count': 0, 'links': {'self': '/company/08892216/officers'}, 'etag': '1ccc4c74454bc359a9b9ef686d11e70cc03b6cb8', 'kind': 'officer-list', 'resigned_count': 0, 'total_results': 1, 'active_count': 1, 'items_per_page': 35, 'items': [{'date_of_birth': {'year': 1964, 'month': 2}, 'links': {'self': '/company/08892216/appointments/Ao1-pIsFQHy0FVSv26y6ep4Qhyw', 'officer': {'appointments': '/officers/JpJgO2pKVmGq5VHD7j7lluLW-jo/appointments'}}, 'officer_role': 'director', 'address': {'premises': '19', 'locality': 'London', 'country': 'United Kingdom', 'address_line_1': 'Norcott Road', 'postal_code': 'N16 7EJ'}, 'appointed_on': '2014-02-12', 'nationality': 'Irish', 'country_of_residence': 'England', 'name': 'FLOOD, Eileen', 'occupation': 'Company Director'}], 'start_index': 0}
A = pd.DataFrame.from_dict(json_search_result, orient='index').T

which gives:

inactive_count                                   links  \
0              0  {'self': '/company/08892216/officers'}   

                                       etag          kind resigned_count  \
0  1ccc4c74454bc359a9b9ef686d11e70cc03b6cb8  officer-list              0   

  total_results active_count items_per_page  \
0             1            1             35   

                                               items start_index  
0  [{'date_of_birth': {'year': 1964, 'month': 2},...           0

Now, this function will flatten ANY dataframe with nested json strings:


def flatten_nested_json_df(df):
    df = df.reset_index()
    s = (df.applymap(type) == list).all()
    list_columns = s[s].index.tolist()

    s = (df.applymap(type) == dict).all()
    dict_columns = s[s].index.tolist()

    while len(list_columns) > 0 or len(dict_columns) > 0:
        new_columns = []

        for col in dict_columns:
            exploded = pd.json_normalize(df[col]).add_prefix(f'{col}.')
            exploded.index = df.index
            df = pd.concat([df, exploded], axis=1).drop(columns=[col])
            new_columns.extend(exploded.columns)  # inplace

        for col in list_columns:
            # print(f"exploding: {col}")
            df = df.drop(columns=[col]).join(df[col].explode().to_frame())
            new_columns.append(col)

        s = (df[new_columns].applymap(type) == list).all()
        list_columns = s[s].index.tolist()

        s = (df[new_columns].applymap(type) == dict).all()
        dict_columns = s[s].index.tolist()
    return df

Applying it:

flatten_nested_json_df(A)

will result in:

  index inactive_count                                      etag  \
0      0              0  1ccc4c74454bc359a9b9ef686d11e70cc03b6cb8   

           kind resigned_count total_results active_count items_per_page  \
0  officer-list              0             1            1             35   

  start_index                  links.self  ...  items.occupation  \
0           0  /company/08892216/officers  ...  Company Director   

  items.date_of_birth.year items.date_of_birth.month  \
0                     1964                         2   

                                    items.links.self  \
0  /company/08892216/appointments/Ao1-pIsFQHy0FVS...   

                    items.links.officer.appointments items.address.premises  \
0  /officers/JpJgO2pKVmGq5VHD7j7lluLW-jo/appointm...                     19   

   items.address.locality  items.address.country items.address.address_line_1  \
0                  London         United Kingdom                 Norcott Road   

  items.address.postal_code  
0                   N16 7EJ

As you can see, the nested fields have been made to columns:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 25 columns):
 #   Column                            Non-Null Count  Dtype 
---  ------                            --------------  ----- 
 0   index                             1 non-null      int64 
 1   inactive_count                    1 non-null      object
 2   etag                              1 non-null      object
 3   kind                              1 non-null      object
 4   resigned_count                    1 non-null      object
 5   total_results                     1 non-null      object
 6   active_count                      1 non-null      object
 7   items_per_page                    1 non-null      object
 8   start_index                       1 non-null      object
 9   links.self                        1 non-null      object
 10  items.officer_role                1 non-null      object
 11  items.appointed_on                1 non-null      object
 12  items.nationality                 1 non-null      object
 13  items.country_of_residence        1 non-null      object
 14  items.name                        1 non-null      object
 15  items.occupation                  1 non-null      object
 16  items.date_of_birth.year          1 non-null      int64 
 17  items.date_of_birth.month         1 non-null      int64 
 18  items.links.self                  1 non-null      object
 19  items.links.officer.appointments  1 non-null      object
 20  items.address.premises            1 non-null      object
 21  items.address.locality            1 non-null      object
 22  items.address.country             1 non-null      object
 23  items.address.address_line_1      1 non-null      object
 24  items.address.postal_code         1 non-null      object
dtypes: int64(3), object(22)
memory usage: 328.0  bytes
None