Home > database >  Flatten a column with nested dictionary in pandas
Flatten a column with nested dictionary in pandas

Time:02-01

I fetch data from MySQL database. The fetched data has a column with nested lists and dictionaries. This json is similar to the stored data in my database:

my_dict = {'id': [1, 2, 3],
           'b': [{'100': [{'p': 10, 'max': 20, 'min': 15},
                          {'p': 20, 'max': 30, 'min': 20}]
                 },
                 {'101': [{'p': 100, 'max': 200, 'min': 150}],
                  '102': [{'p': 105, 'max': 205, 'min': 155},
                          {'p': 102, 'max': 202, 'min': 152}]},
                 {'103': [{'p': 210, 'max': 2110, 'min': 1115}]}]}

and in code, I have df only:

df = pd.DataFrame(my_dict)

df   
      id                             b
   0  1   {'100': [{'p': 10, 'max': 20, 'min': 15}, {'p': 20, 'max': 30, 'min': 20}]}
   1  2   {'101': [{'p': 100, 'max': 200, 'min': 150}], '102': [{'p': 105, 'max': 205, 'min': 155}, {'p': 102, 'max': 202, 'min': 152}]}
   2  3      {'103': [{'p': 210, 'max': 2110, 'min': 1115}]}

Now, I want to flat the column b like the following:

df
      id   key     p     max   min
  0    1    100    10    20    15
  1    1    100    20    30    20
  2    2    101    100   200   150
  3    2    102    105   205   155
  4    2    102    102   202   152
  5    3    103    210   2120  1115

I read about the explode and pd.json_normalize. But they did not help. What is the most efficient solution for this problem?

CodePudding user response:

Use a nested list comprehension:

pd.DataFrame([{'id': i, 'key': k, 'p': i, **x}
              for i, d in zip(my_dict['id'], my_dict['b'])
              for k, l in d.items() for x in l])

Output:

   id  key    p   max   min
0   1  100   10    20    15
1   1  100   20    30    20
2   2  101  100   200   150
3   2  102  105   205   155
4   2  102  102   202   152
5   3  103  210  2110  1115

CodePudding user response:

You can pre-process the dictionary then explode and json_normalize the dataframe

my_dict['b'] = [[d_ | {'key': k}  for k, lst in d.items() for d_ in lst]
                for d in my_dict['b']]

df = (pd.DataFrame(my_dict).explode('b', ignore_index=True)
      .pipe(lambda d: d.join(pd.json_normalize(d.pop('b')))))
pprint(my_dict) # from pprint import pprint

{'b': [[{'key': '100', 'max': 20, 'min': 15, 'p': 10},
        {'key': '100', 'max': 30, 'min': 20, 'p': 20}],
       [{'key': '101', 'max': 200, 'min': 150, 'p': 100},
        {'key': '102', 'max': 205, 'min': 155, 'p': 105},
        {'key': '102', 'max': 202, 'min': 152, 'p': 102}],
       [{'key': '103', 'max': 2110, 'min': 1115, 'p': 210}]],
 'id': [1, 2, 3]}

print(pd.DataFrame(my_dict))

   id                                                  b
0   1  [{'p': 10, 'max': 20, 'min': 15, 'key': '100'}...
1   2  [{'p': 100, 'max': 200, 'min': 150, 'key': '10...
2   3  [{'p': 210, 'max': 2110, 'min': 1115, 'key': '...

print(df)

   id    p   max   min  key
0   1   10    20    15  100
1   1   20    30    20  100
2   2  100   200   150  101
3   2  105   205   155  102
4   2  102   202   152  102
5   3  210  2110  1115  103
  • Related