how to break split dataframe columns of string list of lists into multiple columns-CodePudding

Hi I'm working with a large table where I need to pivot one column into multiple columns but instead I got columns as a list of lists instead. I tried to convert a list of lists into columns however I got errors.

the table should be like this "preferred result"

ID	Date	first order 1	first Order 2	second order1	second order 2
customr1	5-5-2000	Ice Cream	potatoes	Cheese	watermelon
customr2	12-4-2020	Cheese	NA	water	Milk
customr2	12-9-2021	lime	NA	Banana	NA

I've tried these solutions

ID	Date	Purchases	Order
customr1	5-5-2000	Cheese	2
customr1	5-5-2000	Ice Cream	1
customr1	5-5-2000	potatoes	1
customr2	12-4-2020	water	2
customr2	12-4-2020	Milk	2
customr2	12-9-2021	Banana	2
customr2	12-9-2021	lime	1
customr1	5-5-2000	watermelon	2
customr2	12-4-2020	Cheese	1

    df = pd.read_excel(file)
df.head()

pivot_df = df.pivot_table(index=['ID','Date'],columns='Order',values='Purchases',aggfunc=pd.unique,fill_value=None, margins=True,dropna=True,margins_name='All',observed=True).drop(columns['All']).drop(labels=['All'], axis=0).rename(columns={1:'order 1', 2: 'order2'})

pivot_df.head()

the result is a list of list

ID	Date	order 1	order2

splits = [pd.DataFrame(pivot_df['order1'].tolist()).add_prefix(col) for col in pivot_df.columns]

clean_df = pd.concat(splits, axis=1)

TypeError: 'float' object is not iterable

pivot_df['first order'] = (i for i in pivot_df['order1'])

TypeError: object of type 'generator' has no len()

CodePudding user response：

IIUC, you can use melt:

out = (df.melt(['ID', 'Date'], var_name='Order', value_name='Purchases')
         .replace({'^first.*': 1, '^second.*': 2}, regex=True).dropna())
print(out)

# Output
          ID       Date  Order   Purchases
0   customr1   5-5-2000      1   Ice Cream
1   customr2  12-4-2020      1      Cheese
2   customr2  12-9-2021      1        lime
3   customr1   5-5-2000      1    potatoes
6   customr1   5-5-2000      2      Cheese
7   customr2  12-4-2020      2       water
8   customr2  12-9-2021      2      Banana
9   customr1   5-5-2000      2  watermelon
10  customr2  12-4-2020      2        Milk

CodePudding user response：

Here is a way to do what your question asks:

df2 = df.groupby(['ID', 'Date', 'Order']).agg(list).unstack('Order').reset_index()
df2.columns = ['ID', 'Date', 'first order', 'second order']
df2 = pd.concat([df2[['ID','Date']]]  
    ([pd.DataFrame(
        df2.iloc[:,i 2].tolist(), 
        columns=[f'{["first","second"][i]} order {j 1}' 
            for j in range(max(df2.iloc[:,i 2].str.len()))]).fillna(np.nan) 
        for i in range(2)
    ]), axis=1)

Input:

         ID       Date   Purchases  Order
0  customr1   5-5-2000      Cheese      2
1  customr1   5-5-2000   Ice Cream      1
2  customr1   5-5-2000    potatoes      1
3  customr2  12-4-2020       water      2
4  customr2  12-4-2020        Milk      2
5  customr2  12-9-2021      Banana      2
6  customr2  12-9-2021        lime      1
7  customr1   5-5-2000  watermelon      2
8  customr2  12-4-2020      Cheese      1

Output:

         ID       Date first order 1 first order 2 second order 1 second order 2
0  customr1   5-5-2000     Ice Cream      potatoes         Cheese     watermelon
1  customr2  12-4-2020        Cheese           NaN          water           Milk
2  customr2  12-9-2021          lime           NaN         Banana            NaN

Explanation:

Use groupby() and agg(list) to create values that are lists containing up to 2 items per order
Use unstack() to pivot these lists into one column per Order
Use reset_index() to restore ID and Date from index levels to columns
Update the columns of the dataframe to be an index (not multiindex)
In a list comprehension, iterate over orders (1 and 2, a.k.a. first and second) to:
- use tolist() to convert the corresponding column to a list of lists
- use a list comprehension to iterate over range(2) (the max items per Order) and create column names first order 1, first order 2
- use DataFrame() to create a 2-column dataframe for i'th Order
Use concat() to horizontally glue together the ID, Date columns with the list of 2-column Order dataframes.

For a more general solution than can handle arbitrary numbers of orders and items per order, you can do this:

maxOrdinal = df.Order.max()
ordinalsUpto10 = ['zeroth', 'first', 'second', 'third', 'fourth', 'fifth', 'sixth', 'seventh', 'eighth', 'ninth', 'tenth']
ordinals = [ordinalsUpto10[i] if i < 11 else f'{i}th' for i in sorted(df.Order.unique())]
df2 = df.groupby(['ID', 'Date', 'Order']).agg(list).unstack('Order').reset_index().fillna('')
df2.columns = ['ID', 'Date']   [f'{nth} order' for nth in ordinals]
df2 = pd.concat([df2[['ID','Date']]]  
    ([pd.DataFrame(
        df2.iloc[:,i 2].tolist(), 
        columns=[f'{ordinal} order {j 1}' 
            for j in range(max(df2.iloc[:,i 2].str.len()))]).fillna(np.nan) 
        for i, ordinal in enumerate(ordinals)
    ]), axis=1)

Input:

          ID       Date     Purchases  Order
0   customr1   5-5-2000        Cheese      2
1   customr1   5-5-2000     Ice Cream      1
2   customr1   5-5-2000      potatoes      1
3   customr2  12-4-2020         water      2
4   customr2  12-4-2020          Milk      2
5   customr2  12-9-2021        Banana      2
6   customr2  12-9-2021          lime      1
7   customr1   5-5-2000    watermelon      2
8   customr2  12-4-2020        Cheese      1
9   customr1   5-5-2000      lettuces      3
10  customr1   5-5-2000  French beans      3
11  customr1   5-5-2000      radishes      3

Output:

         ID       Date first order 1 first order 2 second order 1 second order 2 third order 1 third order 2 third order 3
0  customr1   5-5-2000     Ice Cream      potatoes         Cheese     watermelon      lettuces  French beans      radishes
1  customr2  12-4-2020        Cheese           NaN          water           Milk           NaN           NaN           NaN
2  customr2  12-9-2021          lime           NaN         Banana            NaN           NaN           NaN           NaN