Nested dictionary with key: list[key:value] pairs to dataframe-CodePudding

I'm currently struggling with creating a dataframe based on a dictionary that is nested like {key1:[{key:value},{key:value}, ...],key2:[{key:value},{key:value},...]} And I want this to go into a dataframe, where the value of key1 and key2 are the index, while the list nested key:value pairs would become the column and record values.

Now, for each key1, key2, etc the list key:value pairs can be different in size. Example data:

some_dict = {'0000297386FB11E2A2730050568F1BAB': [{'FILE_ID': '0000297386FB11E2A2730050568F1BAB'},
  {'FileTime': '1362642335'},
  {'Size': '1016439'},
  {'DocType_Code': 'AF3BD580734A77068DD083389AD7FDAF'},
  {'Filenr': 'F682B798EC9481FF031C4C12865AEB9A'},
  {'DateRegistered': 'FAC4F7F9C3217645C518D5AE473DCB1E'},
  {'TITLE': '2096158F036B0F8ACF6F766A9B61A58B'}],
 '000031EA51DA11E397D30050568F1BAB': [{'FILE_ID': '000031EA51DA11E397D30050568F1BAB'},
  {'FileTime': '1384948248'},
  {'Size': '873514'},
  {'DatePosted': '7C6BCB90AC45DA1ED6D1C376FC300E7B'},
  {'DocType_Code': '28F404E9F3C394518AF2FD6A043D3A81'},
  {'Filenr': '13A6A062672A88DE75C4D35917F3C415'},
  {'DateRegistered': '8DD4262899F20DE45F09F22B3107B026'},
  {'Comment': 'AE207D73C9DDB76E1EEAA9241VJGN02'},
  {'TITLE': 'DF96336A6FE08E34C5A94F6A828B4B62'}]}

The final result should look like this:

Index | File_ID | ... | DatePosted | ... | Comment | Title
0000297386FB11E2A2730050568F1BAB|0000297386FB11E2A2730050568F1BAB|...|NaN|...|NaN|2096158F036B0F8ACF6F766A9B61A58B
000031EA51DA11E397D30050568F1BAB|000031EA51DA11E397D30050568F1BAB|...|7C6BCB90AC45DA1ED6D1C376FC300E7B|...|AE207D73C9DDB76E1EEAA9241VJGN02|DF96336A6FE08E34C5A94F6A828B4B62

Now I've tried to parse the dict directly to pandas using comprehension as suggested in Creating dataframe from a dictionary where entries have different lengths and tried to flatten the dict more, and then parsing it to pandas Flatten nested dictionaries, compressing keys. Both with no avail.

CodePudding user response：

Here you go.

You do not need key of first dict. Because it's also available in lower stages. Then you need to merge multiple dicts into single one. I did that with update. THen we turn dict into pd series. And concat it into a dataframe.

In [39]: seriess = []
    ...: for values in some_dict.values():
    ...:     d = {}
    ...:     for thing in values:
    ...:         d.update(thing)
    ...:     s = pd.Series(d)
    ...:     seriess.append(s)
    ...:

In [40]: pd.concat(seriess,axis=1).T
Out[40]:
                            FILE_ID    FileTime     Size  ...                             TITLE                        DatePosted                          Comment
0  0000297386FB11E2A2730050568F1BAB  1362642335  1016439  ...  2096158F036B0F8ACF6F766A9B61A58B                               NaN                              NaN
1  000031EA51DA11E397D30050568F1BAB  1384948248   873514  ...  DF96336A6FE08E34C5A94F6A828B4B62  7C6BCB90AC45DA1ED6D1C376FC300E7B  AE207D73C9DDB76E1EEAA9241VJGN02

CodePudding user response：

Let's try the following code:

dfs = []
for k in some_dict.keys():
    dfs.append(pd.DataFrame.from_records(some_dict[k]))
    
new_df = [dfs[0].append(x) for x in dfs[1:]][0]

final_result = (new_df
                .groupby(new_df['FILE_ID'].notna().cumsum())
                .first())

Output

    FILE_ID FileTime    Size    DocType_Code    Filenr  DateRegistered  TITLE   DatePosted  Comment
FILE_ID                                 
1   0000297386FB11E2A2730050568F1BAB    1362642335  1016439 AF3BD580734A77068DD083389AD7FDAF    F682B798EC9481FF031C4C12865AEB9A    FAC4F7F9C3217645C518D5AE473DCB1E    2096158F036B0F8ACF6F766A9B61A58B    None    None
2   000031EA51DA11E397D30050568F1BAB    1384948248  873514  28F404E9F3C394518AF2FD6A043D3A81    13A6A062672A88DE75C4D35917F3C415    8DD4262899F20DE45F09F22B3107B026    DF96336A6FE08E34C5A94F6A828B4B62    7C6BCB90AC45DA1ED6D1C376FC300E7B    AE207D73C9DDB76E1EEAA9241VJGN02