How to merge similar columns into a single dictionary column in pandas-CodePudding

I am trying to convert a dataframe that has similar naming convention into a single json format column.

Sample Data:

import pandas as pd

df = pd.DataFrame({'id' : 1,
                  'userName' : 'john',
                  'productlist0.name' : 'shoe',
                  'productlist0.price' : 45.89,
                  'productlist0.brand' : 'nike',
                  'productlist1.name' : 'jeans',
                  'productlist1.price' : 19.45,
                  'productlist1.brand' : 'howes',
                  'productlist2.name' : 'watch',
                  'productlist2.price' : 60.0,
                  'productlist2.brand' : 'fossil'
                  }, index = [0])

So we have bunch of columns starting with productlist and that share the column names after the period.

I need to covert these columns into below json format:

df1 = pd.DataFrame(
                  {'id' : 1,
                   'userName' : 'john',
                   'productlist' : '''[{'name' : 'shoe', 'price' : 45.89, 'brand' : 'nike'},
                                   {'name' : 'jeans', 'price' : 19.45, 'brand' : 'howes'},
                                   {'name' : 'watch', 'price' : 60.0, 'brand' : 'fossil'}]'''
                   }, index = [0]
                    )

df1
id  userName    productlist
0   1   john    [{'name' : 'shoe', 'price' : 45.89, 'brand' : 'nike'}, {'name' : 'jeans', 'price' : 19.45, 'brand' : 'howes'}, {'name' : 'watch', 'price' : 60.0, 'brand' : 'fossil'}]

I tried to use the stack approach and got here:

df.filter(regex = '^productlist').rename(columns = lambda x : re.sub(r'productlist\d\.', '', x)).stack().reset_index().\
            groupby(['level_0', 'level_1'])[0].agg(dict)

level_0  level_1
0        brand      {2: 'nike', 5: 'howes', 8: 'fossil'}
         name        {0: 'shoe', 3: 'jeans', 6: 'watch'}
         price             {1: 45.89, 4: 19.45, 7: 60.0}
Name: 0, dtype: object

But I am not sure how to proceed from here. Could someone please help me on this.

CodePudding user response：

You can do it like:

df1 = df.filter(regex="^productlist\d .").T
df1.index = pd.MultiIndex.from_tuples([(a[0], a[1]) for a in df1.index.str.split(".")])
product_values = df1.unstack().droplevel(0, axis=1).to_dict("records")
df1 = pd.concat(
    [
        df[["id", "userName"]],
        pd.DataFrame({"productlist": [product_values]}, index=[0]),
    ],
    axis=1,
)

First get productlist columns. Then reindex with (productlist, key) format and unstack and get a dict in the form of "records" out of it. That will be your list of dicts. Use that to create a new dataframe with column "productlist" and then concat it to original df without productlist\d . columns

print(df1):

   id userName                                        productlist
0   1     john  [{'brand': 'nike', 'name': 'shoe', 'price': 45...