Home > Software design >  How access data and handle missing data in a dictionaries within a pandas dataframe?
How access data and handle missing data in a dictionaries within a pandas dataframe?

Time:08-05

Given, df:

import pandas as pd
import numpy as np

df = pd.DataFrame(
    {
        "Col1": [1, 2, 3],
        "Person": [
            {
                "ID": 10001,
                "Data": {
                    "Address": {
                        "Street": "1234 Street A",
                        "City": "Houston",
                        "State": "Texas",
                        "Zip": "77002",
                    }
                },
                "Age": 30,
                "Income": 50000,
            },
            {
                "ID": 10002,
                "Data": {
                    "Address": {
                        "Street": "7892 Street A",
                        "City": "Greenville",
                        "State": "Maine",
                        "Zip": np.nan,
                    }
                },
                "Age": np.nan,
                "Income": 63000,
            },
            {"ID": 10003, "Data": {"Address": np.nan}, "Age": 56, "Income": 85000},
        ],
    },
)

Input Dataframe:

   Col1                                             Person
0     1  {'ID': 10001, 'Data': {'Address': {'Street': '...
1     2  {'ID': 10002, 'Data': {'Address': {'Street': '...
2     3  {'ID': 10003, 'Data': {'Address': nan}, 'Age':...

My expected output dataframe is df[['Col1', 'Income', 'Age', 'Street', 'Zip']] where Income, Age, Street, and Zip come from within Person:

   Col1  Income   Age         Street    Zip
0     1   50000  30.0  1234 Street A  77002
1     2   63000   NaN  7892 Street A    nan
2     3   85000  56.0            NaN    nan

CodePudding user response:

Using list comprehension, we can create most of these columns.

df['Income'] = [x.get('Income') for x in df['Person']]
df['Age'] = [x.get('Age') for x in df['Person']]
df['Age']

Output:

0    30.0
1     NaN
2    56.0
Name: Age, dtype: float64

However, dealing with np.nan values inside a nested dictionary is a real pain. Let's look at getting data from a nested dictionary data where one of the values is nan.

df['Street'] = [x.get('Data').get('Address').get('Street') for x in df['Person']]

We get an AttributeError:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-80-cc2f92bfe95d> in <module>
      1 #However, let's look at getting data rom a nested dictionary where one of the values is nan.
      2 
----> 3 df['Street'] = [x.get('Data').get('Address').get('Street') for x in df['Person']]
      4 
      5 #We get and AttributeError because NoneType object has no get method

<ipython-input-80-cc2f92bfe95d> in <listcomp>(.0)
      1 #However, let's look at getting data rom a nested dictionary where one of the values is nan.
      2 
----> 3 df['Street'] = [x.get('Data').get('Address').get('Street') for x in df['Person']]
      4 
      5 #We get and AttributeError because NoneType object has no get method

AttributeError: 'float' object has no attribute 'get'

Let's use the .str accessor with dictionary keys to fetch this data.
There is little documentation in pandas that shows how you can use .str.get or .str[] to fetch values from dictionary objects in a dataframe column/pandas series.

df['Street'] = df['Person'].str['Data'].str['Address'].str['Street']

Output:

0    1234 Street A
1    7892 Street A
2              NaN
Name: Street, dtype: object

And, likewise with

df['Zip'] = df['Person'].str['Data'].str['Address'].str['Zip']

Leaving use with the columns to build the desired dataframe df[['Col1', 'Income', 'Age', 'Street', 'Zip']] from dictionaries.

Output:

   Col1  Income   Age         Street    Zip
0     1   50000  30.0  1234 Street A  77002
1     2   63000   NaN  7892 Street A    NaN
2     3   85000  56.0            NaN    NaN

CodePudding user response:

import pandas as pd
import numpy as np

df = pd.DataFrame({
    "Col1": [1, 2, 3],
    "Person": [
        {
            "ID": 10001,
            "Data": {
                "Address": {
                    "Street": "1234 Street A",
                    "City": "Houston",
                    "State": "Texas",
                    "Zip": "77002",
                }
            },
            "Age": 30,
            "Income": 50000,
        },
        {
            "ID": 10002,
            "Data": {
                "Address": {
                    "Street": "7892 Street A",
                    "Zip": np.nan,
                    "City": "Greenville",
                    "State": "Maine",
                }
            },
            "Age": np.nan,
            "Income": 63000,
        },
        {
            "ID": 10003,
            "Data": {"Address": np.nan},
            "Age": 56, "Income": 85000
        },
    ],
})


row_dic_list = df.to_dict(orient='records') # convert to dict 

# remain = ['Col1', 'Income', 'Age', 'Street', 'Zip']  
new_row_dict_list = []
# Iterate over each row to generate new data
for row_dic in row_dic_list: 
    col1 = row_dic['Col1']
    person_dict = row_dic['Person']

    age = person_dict['Age']
    income = person_dict['Income']
    address = person_dict["Data"]["Address"]
    street = np.nan
    zip_v = np.nan
    if isinstance(address, dict):
        street = address["Street"]
        zip_v = address["Zip"]
    new_row_dict = {
        'Col1': col1,
        'Income': income,
        'Age': age,
        'Street': street,
        'Zip': zip_v,
    }
    new_row_dict_list.append(new_row_dict)
# Generate a dataframe from each new row of data
new_df = pd.DataFrame(new_row_dict_list)
print(new_df)

"""
   Col1  Income   Age         Street    Zip
0     1   50000  30.0  1234 Street A  77002
1     2   63000   NaN  7892 Street A    NaN
2     3   85000  56.0            NaN    NaN

"""

  • Related