I'm currently struggling with creating a dataframe based on a dictionary that is nested like {key1:[{key:value},{key:value}, ...],key2:[{key:value},{key:value},...]}
And I want this to go into a dataframe, where the value of key1 and key2
are the index, while the list
nested key:value
pairs would become the column and record
values.
Now, for each key1, key2, etc
the list key:value pairs can be different in size. Example data:
some_dict = {'0000297386FB11E2A2730050568F1BAB': [{'FILE_ID': '0000297386FB11E2A2730050568F1BAB'},
{'FileTime': '1362642335'},
{'Size': '1016439'},
{'DocType_Code': 'AF3BD580734A77068DD083389AD7FDAF'},
{'Filenr': 'F682B798EC9481FF031C4C12865AEB9A'},
{'DateRegistered': 'FAC4F7F9C3217645C518D5AE473DCB1E'},
{'TITLE': '2096158F036B0F8ACF6F766A9B61A58B'}],
'000031EA51DA11E397D30050568F1BAB': [{'FILE_ID': '000031EA51DA11E397D30050568F1BAB'},
{'FileTime': '1384948248'},
{'Size': '873514'},
{'DatePosted': '7C6BCB90AC45DA1ED6D1C376FC300E7B'},
{'DocType_Code': '28F404E9F3C394518AF2FD6A043D3A81'},
{'Filenr': '13A6A062672A88DE75C4D35917F3C415'},
{'DateRegistered': '8DD4262899F20DE45F09F22B3107B026'},
{'Comment': 'AE207D73C9DDB76E1EEAA9241VJGN02'},
{'TITLE': 'DF96336A6FE08E34C5A94F6A828B4B62'}]}
The final result should look like this:
Index | File_ID | ... | DatePosted | ... | Comment | Title
0000297386FB11E2A2730050568F1BAB|0000297386FB11E2A2730050568F1BAB|...|NaN|...|NaN|2096158F036B0F8ACF6F766A9B61A58B
000031EA51DA11E397D30050568F1BAB|000031EA51DA11E397D30050568F1BAB|...|7C6BCB90AC45DA1ED6D1C376FC300E7B|...|AE207D73C9DDB76E1EEAA9241VJGN02|DF96336A6FE08E34C5A94F6A828B4B62
Now I've tried to parse the dict directly to pandas using comprehension as suggested in Creating dataframe from a dictionary where entries have different lengths and tried to flatten the dict more, and then parsing it to pandas Flatten nested dictionaries, compressing keys. Both with no avail.
CodePudding user response:
Here you go.
You do not need key of first dict. Because it's also available in lower stages. Then you need to merge multiple dicts into single one. I did that with update. THen we turn dict into pd series. And concat it into a dataframe.
In [39]: seriess = []
...: for values in some_dict.values():
...: d = {}
...: for thing in values:
...: d.update(thing)
...: s = pd.Series(d)
...: seriess.append(s)
...:
In [40]: pd.concat(seriess,axis=1).T
Out[40]:
FILE_ID FileTime Size ... TITLE DatePosted Comment
0 0000297386FB11E2A2730050568F1BAB 1362642335 1016439 ... 2096158F036B0F8ACF6F766A9B61A58B NaN NaN
1 000031EA51DA11E397D30050568F1BAB 1384948248 873514 ... DF96336A6FE08E34C5A94F6A828B4B62 7C6BCB90AC45DA1ED6D1C376FC300E7B AE207D73C9DDB76E1EEAA9241VJGN02
CodePudding user response:
Let's try the following code:
dfs = []
for k in some_dict.keys():
dfs.append(pd.DataFrame.from_records(some_dict[k]))
new_df = [dfs[0].append(x) for x in dfs[1:]][0]
final_result = (new_df
.groupby(new_df['FILE_ID'].notna().cumsum())
.first())
Output
FILE_ID FileTime Size DocType_Code Filenr DateRegistered TITLE DatePosted Comment
FILE_ID
1 0000297386FB11E2A2730050568F1BAB 1362642335 1016439 AF3BD580734A77068DD083389AD7FDAF F682B798EC9481FF031C4C12865AEB9A FAC4F7F9C3217645C518D5AE473DCB1E 2096158F036B0F8ACF6F766A9B61A58B None None
2 000031EA51DA11E397D30050568F1BAB 1384948248 873514 28F404E9F3C394518AF2FD6A043D3A81 13A6A062672A88DE75C4D35917F3C415 8DD4262899F20DE45F09F22B3107B026 DF96336A6FE08E34C5A94F6A828B4B62 7C6BCB90AC45DA1ED6D1C376FC300E7B AE207D73C9DDB76E1EEAA9241VJGN02