Given, df:
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
"Col1": [1, 2, 3],
"Person": [
{
"ID": 10001,
"Data": {
"Address": {
"Street": "1234 Street A",
"City": "Houston",
"State": "Texas",
"Zip": "77002",
}
},
"Age": 30,
"Income": 50000,
},
{
"ID": 10002,
"Data": {
"Address": {
"Street": "7892 Street A",
"City": "Greenville",
"State": "Maine",
"Zip": np.nan,
}
},
"Age": np.nan,
"Income": 63000,
},
{"ID": 10003, "Data": {"Address": np.nan}, "Age": 56, "Income": 85000},
],
},
)
Input Dataframe:
Col1 Person
0 1 {'ID': 10001, 'Data': {'Address': {'Street': '...
1 2 {'ID': 10002, 'Data': {'Address': {'Street': '...
2 3 {'ID': 10003, 'Data': {'Address': nan}, 'Age':...
My expected output dataframe is df[['Col1', 'Income', 'Age', 'Street', 'Zip']]
where Income, Age, Street, and Zip come from within Person:
Col1 Income Age Street Zip
0 1 50000 30.0 1234 Street A 77002
1 2 63000 NaN 7892 Street A nan
2 3 85000 56.0 NaN nan
CodePudding user response:
Using list comprehension, we can create most of these columns.
df['Income'] = [x.get('Income') for x in df['Person']]
df['Age'] = [x.get('Age') for x in df['Person']]
df['Age']
Output:
0 30.0
1 NaN
2 56.0
Name: Age, dtype: float64
However, dealing with np.nan values inside a nested dictionary is a real pain. Let's look at getting data from a nested dictionary data where one of the values is nan.
df['Street'] = [x.get('Data').get('Address').get('Street') for x in df['Person']]
We get an AttributeError:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-80-cc2f92bfe95d> in <module>
1 #However, let's look at getting data rom a nested dictionary where one of the values is nan.
2
----> 3 df['Street'] = [x.get('Data').get('Address').get('Street') for x in df['Person']]
4
5 #We get and AttributeError because NoneType object has no get method
<ipython-input-80-cc2f92bfe95d> in <listcomp>(.0)
1 #However, let's look at getting data rom a nested dictionary where one of the values is nan.
2
----> 3 df['Street'] = [x.get('Data').get('Address').get('Street') for x in df['Person']]
4
5 #We get and AttributeError because NoneType object has no get method
AttributeError: 'float' object has no attribute 'get'
Let's use the .str
accessor with dictionary keys to fetch this data.
There is little documentation in pandas that shows how you can use .str.get
or .str[]
to fetch values from dictionary objects in a dataframe column/pandas series.
df['Street'] = df['Person'].str['Data'].str['Address'].str['Street']
Output:
0 1234 Street A
1 7892 Street A
2 NaN
Name: Street, dtype: object
And, likewise with
df['Zip'] = df['Person'].str['Data'].str['Address'].str['Zip']
Leaving use with the columns to build the desired dataframe
df[['Col1', 'Income', 'Age', 'Street', 'Zip']]
from dictionaries.
Output:
Col1 Income Age Street Zip
0 1 50000 30.0 1234 Street A 77002
1 2 63000 NaN 7892 Street A NaN
2 3 85000 56.0 NaN NaN
CodePudding user response:
import pandas as pd
import numpy as np
df = pd.DataFrame({
"Col1": [1, 2, 3],
"Person": [
{
"ID": 10001,
"Data": {
"Address": {
"Street": "1234 Street A",
"City": "Houston",
"State": "Texas",
"Zip": "77002",
}
},
"Age": 30,
"Income": 50000,
},
{
"ID": 10002,
"Data": {
"Address": {
"Street": "7892 Street A",
"Zip": np.nan,
"City": "Greenville",
"State": "Maine",
}
},
"Age": np.nan,
"Income": 63000,
},
{
"ID": 10003,
"Data": {"Address": np.nan},
"Age": 56, "Income": 85000
},
],
})
row_dic_list = df.to_dict(orient='records') # convert to dict
# remain = ['Col1', 'Income', 'Age', 'Street', 'Zip']
new_row_dict_list = []
# Iterate over each row to generate new data
for row_dic in row_dic_list:
col1 = row_dic['Col1']
person_dict = row_dic['Person']
age = person_dict['Age']
income = person_dict['Income']
address = person_dict["Data"]["Address"]
street = np.nan
zip_v = np.nan
if isinstance(address, dict):
street = address["Street"]
zip_v = address["Zip"]
new_row_dict = {
'Col1': col1,
'Income': income,
'Age': age,
'Street': street,
'Zip': zip_v,
}
new_row_dict_list.append(new_row_dict)
# Generate a dataframe from each new row of data
new_df = pd.DataFrame(new_row_dict_list)
print(new_df)
"""
Col1 Income Age Street Zip
0 1 50000 30.0 1234 Street A 77002
1 2 63000 NaN 7892 Street A NaN
2 3 85000 56.0 NaN NaN
"""