I've been fighting to go from a nested dictionary of depth D to a pandas DataFrame.
I've tried with recursive function, like the following one, but my problem is that when I'm iterating over a KEY, I don't know what was the pervious key.
I've also tried with json.normalize, pandas from dict but I always end up with dots in the columns...
Example code:
def iterate_dict(d, i = 2, cols = []):
for k, v in d.items():
# missing here how to check for the previous key
# so that I can create an structure to create the dataframe.
if type(v) is dict:
print('this is k: ', k)
if i % 2 == 0:
cols.append(k)
i =1
iterate_dict(v, i, cols)
else:
print('this is k2: ' , k, ': ', v)
iterate_dict(test2)
This is an example of how my dictionary looks like:
# example 2
test = {
'column-gender': {
'male': {
'column-country' : {
'FRENCH': {
'column-class': [0,1]
},
('SPAIN','ITALY') : {
'column-married' : {
'YES': {
'column-class' : [0,1]
},
'NO' : {
'column-class' : 2
}
}
}
}
},
'female': {
'column-country' : {
('FRENCH', 'SPAIN') : {
'column-class' : [[1,2],'#']
},
'REST-OF-VALUES': {
'column-married' : '*'
}
}
}
}
}
And this is how I want the dataframe to look like:
Any suggestion is welcome :)
CodePudding user response:
I'm not sure how that data going to be consistent but for just understanding we can do something like the below, remember this is just a little demo on the approach of how we can handle it, you can spend more time to polish it up accordingly:
I added comments on each step for better understanding.
import pandas as pd
def nested_dict_to_df(data, columns=None):
if columns are None:
columns = []
# if the data is a dictionary, then we need to iterate over the keys
if isinstance(data, dict):
for key, value in data.items():
columns.append(key)
yield from nested_dict_to_df(value, columns) # recursive call
columns.pop() # remove the last element
else:
yield columns [data]
df = pd.DataFrame(nested_dict_to_df(data))
# Drop column [0, 2, 4, 6] from the dataframe that are not needed for the final output
df = df.drop(df.columns[[0, 2, 4, 6]], axis=1)
header = ["GENDER", "COUNTRY", "CLASS", "MARRIED"] # Desired header
df.columns = header
print(df)
Output:
GENDER COUNTRY CLASS MARRIED
0 male FRENCH [0, 1] None
1 male (SPAIN, ITALY) YES [0, 1]
2 male (SPAIN, ITALY) NO 2
3 female (FRENCH, SPAIN) [[1, 2], #] None
4 female REST-OF-VALUES * None
CodePudding user response:
If the column-keys are consistently prefixed with column-
, you can create a recursive function:
def data_to_df(data):
rec_out = []
columns = []
def dict_to_rec(d, depth=0, curr_row={}):
for k, v in d.items():
if 'column-' in k: # definition of a column
columns.append(k[7:])
if isinstance(v, dict):
for val, nested_dict in v.items():
dict_to_rec(nested_dict, depth 1, dict(curr_row, **{columns[depth]: val}))
else:
rec_out.append(dict(curr_row, **{columns[depth]: v}))
dict_to_rec(data)
return pd.DataFrame(rec_out)
print(data_to_df(test))
Output:
gender country class married
0 male FRENCH [0, 1] NaN
1 male (SPAIN, ITALY) YES [0, 1]
2 male (SPAIN, ITALY) NO 2
3 female (FRENCH, SPAIN) [[1, 2], #] NaN
4 female REST-OF-VALUES * NaN