Home > database >  Split list of dictionary column to separate columns, melting the dataframe
Split list of dictionary column to separate columns, melting the dataframe

Time:06-21

I have a table with a column with list of dictionaries. I need to first split the list of dictionaries to each dictionary in a separate column. In the next step, I need to convert the dictionary to their respective columns, while melting the dataframe. An breakdown of steps is given below-

Step 1-

enter image description here

Step 2-

enter image description here

Step 3 (Output)-

enter image description here

CodePudding user response:

Use nested list comprehension for list of dictionaries and then pass to DataFrame constructor:

df = pd.DataFrame({'id':[1,2], 'variant':[[{'position':1, 'price':100}, {'position':2, 'price':500}, 
                                           {'position':3, 'price':300}],
                                         [ {'position':1, 'price':150}, {'position':2, 'price':400}]]})
    
print (df)
   id                                            variant
0   1  [{'position': 1, 'price': 100}, {'position': 2...
1   2  [{'position': 1, 'price': 150}, {'position': 2...

L = [{**{'id':x},**z} for x,  y in zip(df['id'], df['variant']) for z in y]

df2 = pd.DataFrame(L)

Or DataFrame.explode with json_normalize:

df1 = df.explode('variant').reset_index(drop=True)

df2 = df1[['id']].join(pd.json_normalize(df1['variant']))

print (df2)
   id  position  price
0   1         1    100
1   1         2    500
2   1         3    300
3   2         1    150
4   2         2    400

If solution above return:

TypeError: 'str' object is not a mapping

because there are strings solution is:

df = pd.DataFrame({'id':[1,2], 'variant':["[{'position':1, 'price':100}, {'position':2, 'price':500}, {'position':3, 'price':300}]",
                                          "[ {'position':1, 'price':150}, {'position':2, 'price':400}]"]})
    
print (df)
   id                                            variant
0   1  [{'position':1, 'price':100}, {'position':2, '...
1   2  [ {'position':1, 'price':150}, {'position':2, ...
                  

import ast
                                       
L = [{**{'id':x},**z} for x,y in zip(df['id'], df['variant']) for z in ast.literal_eval(y)]

df2 = pd.DataFrame(L)
print (df2)
   id  position  price
0   1         1    100
1   1         2    500
2   1         3    300
3   2         1    150
4   2         2    400

Performance in 10k rows:

df = pd.DataFrame({'id':[1,2], 'variant':[[{'position':1, 'price':100}, {'position':2, 'price':500}, 
                                           {'position':3, 'price':300}],
                                         [ {'position':1, 'price':150}, {'position':2, 'price':400}]]})
    
df = pd.concat([df] * 5000, ignore_index=True)

#keramat solution
In [23]: %%timeit
    ...: df.explode('variant').apply({'variant':lambda x: pd.Series(x), 'id': lambda x: pd.Series(x)}).droplevel(0, axis = 1).rename(columns={'position':'position', 'price':'price', 0:'id'})
    ...: 
14.2 s ± 505 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [24]: %%timeit 
    ...: df1 = df.explode('variant').reset_index(drop=True)
    ...: 
    ...: df1[['id']].join(pd.json_normalize(df1['variant']))
    ...: 
    ...: 
180 ms ± 4.06 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [25]: %%timeit
    ...: pd.DataFrame([{**{'id':x},**z} for x,  y in zip(df['id'], df['variant']) for z in y])
    ...: 
    ...: 
52.3 ms ± 2.76 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

CodePudding user response:

Use:

df.explode('variant').apply({'id': lambda x: x, 'variant':lambda x: pd.Series(x)}).droplevel(0, axis = 1)

Output:

id  position    price
0   1   1   100
0   1   2   500
0   1   3   300
1   2   1   150
1   2   2   400
  • Related