so I have the following dictionary:
user_dict = {'user1': {'id1': {('word1', 'word2'): 0.99, ('word3', 'word4'): 0.16},
'id2': {('word5', 'word6'): 0.73, ('word7', 'word8'): 0.69}},
'user2': {'id3': {('word9', 'word10'): 0.59, ('word11', 'word12'): 0.13},
'id4': {('word13', 'word14'): 0.41, ('word14', 'word15'): 0.74}}}
For my purpose I would like to convert the nested dictionary into a pandas dataframe of the form:
user | id | w1 | w2 | score
---------------------------------------
user1 | id1 | word1 | word2 | 0.99
| | word3 | word4 | 0.16
| id2 | word5 | word6 | 0.73 and so on.
I've tried a few ways before, and this is my current solution:
df = pd.Series({(i,j): user_dict[i][j]
for i in user_dict.keys()
for j in user_dict[i].keys()}).rename_axis(['user', 'id']).reset_index(name='Col3')
So the output is:
user | id | Col3
-------------------------------------------------------------------
user1 | id1 | {('word1', 'word2'): 0.99, ('word3', 'word4'): 0.16)}
user1 | id2 | {('word5', 'word6'): 0.73, ('word7', 'word8'): 0.69)} and so on.
Can someone tell me what I am doing wrong with the last columns?
CodePudding user response:
You could use a nested list comprehension/generator:
df = pd.DataFrame(([k0, k1, *k2, d2]
for k0, d0 in user_dict.items()
for k1, d1 in d0.items()
for k2, d2 in d1.items()
), columns=['user', 'id', 'w1', 'w2', 'score'])
output:
user id w1 w2 score
0 user1 id1 word1 word2 0.99
1 user1 id1 word3 word4 0.16
2 user1 id2 word5 word6 0.73
3 user1 id2 word7 word8 0.69
4 user2 id3 word9 word10 0.59
5 user2 id3 word11 word12 0.13
6 user2 id4 word13 word14 0.41
7 user2 id4 word14 word15 0.74
CodePudding user response:
Alternatively, with fewer loops:
>>> pd.concat({k: pd.DataFrame(v) for k, v in user_dict.items()}).melt(ignore_index=False).dropna()
variable value
user1 word1 word2 id1 0.99
word3 word4 id1 0.16
word5 word6 id2 0.73
word7 word8 id2 0.69
user2 word9 word10 id3 0.59
word11 word12 id3 0.13
word13 word14 id4 0.41
word14 word15 id4 0.74