Home > Back-end >  Flattening List of Dict containing multiple nested lists using pandas json_normalize
Flattening List of Dict containing multiple nested lists using pandas json_normalize

Time:12-14

We have following list -

users_info = [
    {
        "Id": "21",
        "Name": "ABC",
        "Country": {
            "Name": "Country 1"
        },
        "Addresses": {
            "records": [
                {
                    "addressId": "12",
                    "line1": "xyz, 102",
                    "city": "PQR"
                },
                {
                    "addressId": "13",
                    "line1": "YTR, 102",
                    "city": "NMS"
                }
            ]
        },
        "Education": {
            "records": [
                {
                    "Id": "45",
                    "Degree": "Bachelors"
                },
                {
                    "Id": "49",
                    "Degree": "Masters"
                }
            ]
        }
        
    },
    {
        "Id": "26",
        "Name": "PEW",
        "Country": {
            "Name": "Country 2"
        },
        "Addresses": {
            "records": [
                {
                    "addressId": "10",
                    "line1": "BTR, 12",
                    "city": "UYT"
                },
                {
                    "addressId": "123",
                    "line1": "MEQW, 6",
                    "city": "KJH"
                }
            ]
        },
        "Education": {
            "records": [
                {
                    "Id": "45",
                    "Degree": "Bachelors"
                },
                {
                    "Id": "49",
                    "Degree": "Masters"
                }
            ]
        }
        
    },
    {
        "Id": "214",
        "Name": "TUF",
        "Country": None,
        "Addresses": {
            "records": None
        },
        "Education": {
            "records": [
                {
                    "Id": "45",
                    "Degree": "Bachelors"
                },
                {
                    "Id": "49",
                    "Degree": "Masters"
                }
            ]
        }
        
    },
    {
        "Id": "2609",
        "Name": "JJU",
        "Country": {
            "Name": "Country 2"
        },
        "Addresses": {
            "records": [
                {
                    "addressId": "10",
                    "line1": "BTR, 12",
                    "city": "UYT"
                },
                {
                    "addressId": "123",
                    "line1": "MEQW, 6",
                    "city": "KJH"
                }
            ]
        },
        "Education": None       
    }       
]

I want to flatten this list of dictionaries and create pandas dataframe -

Id | Name | Country.Name | Addresses.addressId | Addresses.line1 | Addresses.city | Education.Id | Education.Degree

There are two list of dictionaries - Addresses and Education. There is possibilty that any of these (Country, Addresses and Education) can be None

How do I create dataframe using above data to get the required format?

Here's what I have tried -

dataset_1 = pandas.json_normalize([user for user in users_info],
                                record_path = ["Addresses"],
                                meta = ["Id", "Name", ["Country", "Name"]],
                                errors="ignore"
                               )
dataset_2 = pandas.json_normalize([user for user in users_info],
                                record_path = ["Education"],
                                meta = ["Id", "Name", ["Country", "Name"]],
                                errors="ignore"
                               )
dataset = pandas.concat([dataset_1, dataset_2], axis=1)

When dataset_1 is executing, I am getting -

NoneType object is not iterable error

CodePudding user response:

You can use:

df = pd.json_normalize(users_info)
df_addresses = df['Addresses.records'].explode().apply(pd.Series)
df_addresses.rename(columns={col:f'Addresses.{col}' for col in df_addresses.columns}, inplace=True)

df_education = df['Education.records'].explode().apply(pd.Series)
df_education.rename(columns={col:f'Education.{col}' for col in df_education.columns}, inplace=True)

cols = [col for col in df.columns if col not in ['Addresses.records', 'Education.records']]
df = df[cols].join(df_addresses).join(df_education)
df.dropna(axis=1, how='all', inplace=True)
print(df)

OUTPUT

Id Name Country.Name Addresses.addressId Addresses.line1 Addresses.city Education.Degree Education.Id
0    21  ABC    Country 1                  12        xyz, 102            PQR        Bachelors           45
0    21  ABC    Country 1                  12        xyz, 102            PQR          Masters           49
0    21  ABC    Country 1                  13        YTR, 102            NMS        Bachelors           45
0    21  ABC    Country 1                  13        YTR, 102            NMS          Masters           49
1    26  PEW    Country 2                  10         BTR, 12            UYT        Bachelors           45
1    26  PEW    Country 2                  10         BTR, 12            UYT          Masters           49
1    26  PEW    Country 2                 123         MEQW, 6            KJH        Bachelors           45
1    26  PEW    Country 2                 123         MEQW, 6            KJH          Masters           49
2   214  TUF          NaN                 NaN             NaN            NaN        Bachelors           45
2   214  TUF          NaN                 NaN             NaN            NaN          Masters           49
3  2609  JJU    Country 2                  10         BTR, 12            UYT              NaN          NaN
3  2609  JJU    Country 2                 123         MEQW, 6            KJH              NaN          NaN
  • Related