Hi I'm working with a large table where I need to pivot one column into multiple columns but instead I got columns as a list of lists instead. I tried to convert a list of lists into columns however I got errors.
the table should be like this "preferred result"
ID | Date | first order 1 | first Order 2 | second order1 | second order 2 |
---|---|---|---|---|---|
customr1 | 5-5-2000 | Ice Cream | potatoes | Cheese | watermelon |
customr2 | 12-4-2020 | Cheese | NA | water | Milk |
customr2 | 12-9-2021 | lime | NA | Banana | NA |
I've tried these solutions
ID | Date | Purchases | Order |
---|---|---|---|
customr1 | 5-5-2000 | Cheese | 2 |
customr1 | 5-5-2000 | Ice Cream | 1 |
customr1 | 5-5-2000 | potatoes | 1 |
customr2 | 12-4-2020 | water | 2 |
customr2 | 12-4-2020 | Milk | 2 |
customr2 | 12-9-2021 | Banana | 2 |
customr2 | 12-9-2021 | lime | 1 |
customr1 | 5-5-2000 | watermelon | 2 |
customr2 | 12-4-2020 | Cheese | 1 |
df = pd.read_excel(file)
df.head()
pivot_df = df.pivot_table(index=['ID','Date'],columns='Order',values='Purchases',aggfunc=pd.unique,fill_value=None, margins=True,dropna=True,margins_name='All',observed=True).drop(columns['All']).drop(labels=['All'], axis=0).rename(columns={1:'order 1', 2: 'order2'})
pivot_df.head()
the result is a list of list
ID | Date | order 1 | order2 |
---|
|customr1|5-5-2000| [Ice Cream,potatoes] | [Cheese, watermelon] |
|customr2|12-4-2020|[Cheese] | [water, Milk] |
|customr2|12-9-2021| [lime] | [Banana] |
splits = [pd.DataFrame(pivot_df['order1'].tolist()).add_prefix(col) for col in pivot_df.columns]
clean_df = pd.concat(splits, axis=1)
TypeError: 'float' object is not iterable
pivot_df['first order'] = (i for i in pivot_df['order1'])
TypeError: object of type 'generator' has no len()
CodePudding user response:
IIUC, you can use melt
:
out = (df.melt(['ID', 'Date'], var_name='Order', value_name='Purchases')
.replace({'^first.*': 1, '^second.*': 2}, regex=True).dropna())
print(out)
# Output
ID Date Order Purchases
0 customr1 5-5-2000 1 Ice Cream
1 customr2 12-4-2020 1 Cheese
2 customr2 12-9-2021 1 lime
3 customr1 5-5-2000 1 potatoes
6 customr1 5-5-2000 2 Cheese
7 customr2 12-4-2020 2 water
8 customr2 12-9-2021 2 Banana
9 customr1 5-5-2000 2 watermelon
10 customr2 12-4-2020 2 Milk
CodePudding user response:
Here is a way to do what your question asks:
df2 = df.groupby(['ID', 'Date', 'Order']).agg(list).unstack('Order').reset_index()
df2.columns = ['ID', 'Date', 'first order', 'second order']
df2 = pd.concat([df2[['ID','Date']]]
([pd.DataFrame(
df2.iloc[:,i 2].tolist(),
columns=[f'{["first","second"][i]} order {j 1}'
for j in range(max(df2.iloc[:,i 2].str.len()))]).fillna(np.nan)
for i in range(2)
]), axis=1)
Input:
ID Date Purchases Order
0 customr1 5-5-2000 Cheese 2
1 customr1 5-5-2000 Ice Cream 1
2 customr1 5-5-2000 potatoes 1
3 customr2 12-4-2020 water 2
4 customr2 12-4-2020 Milk 2
5 customr2 12-9-2021 Banana 2
6 customr2 12-9-2021 lime 1
7 customr1 5-5-2000 watermelon 2
8 customr2 12-4-2020 Cheese 1
Output:
ID Date first order 1 first order 2 second order 1 second order 2
0 customr1 5-5-2000 Ice Cream potatoes Cheese watermelon
1 customr2 12-4-2020 Cheese NaN water Milk
2 customr2 12-9-2021 lime NaN Banana NaN
Explanation:
- Use
groupby()
andagg(list)
to create values that are lists containing up to 2 items per order - Use
unstack()
to pivot these lists into one column perOrder
- Use
reset_index()
to restoreID
andDate
from index levels to columns - Update the columns of the dataframe to be an index (not multiindex)
- In a list comprehension, iterate over orders (1 and 2, a.k.a.
first
andsecond
) to:- use
tolist()
to convert the corresponding column to a list of lists - use a list comprehension to iterate over
range(2)
(the max items perOrder
) and create column namesfirst order 1, first order 2
- use
DataFrame()
to create a 2-column dataframe fori
'thOrder
- use
- Use
concat()
to horizontally glue together theID, Date
columns with the list of 2-columnOrder
dataframes.
For a more general solution than can handle arbitrary numbers of orders and items per order, you can do this:
maxOrdinal = df.Order.max()
ordinalsUpto10 = ['zeroth', 'first', 'second', 'third', 'fourth', 'fifth', 'sixth', 'seventh', 'eighth', 'ninth', 'tenth']
ordinals = [ordinalsUpto10[i] if i < 11 else f'{i}th' for i in sorted(df.Order.unique())]
df2 = df.groupby(['ID', 'Date', 'Order']).agg(list).unstack('Order').reset_index().fillna('')
df2.columns = ['ID', 'Date'] [f'{nth} order' for nth in ordinals]
df2 = pd.concat([df2[['ID','Date']]]
([pd.DataFrame(
df2.iloc[:,i 2].tolist(),
columns=[f'{ordinal} order {j 1}'
for j in range(max(df2.iloc[:,i 2].str.len()))]).fillna(np.nan)
for i, ordinal in enumerate(ordinals)
]), axis=1)
Input:
ID Date Purchases Order
0 customr1 5-5-2000 Cheese 2
1 customr1 5-5-2000 Ice Cream 1
2 customr1 5-5-2000 potatoes 1
3 customr2 12-4-2020 water 2
4 customr2 12-4-2020 Milk 2
5 customr2 12-9-2021 Banana 2
6 customr2 12-9-2021 lime 1
7 customr1 5-5-2000 watermelon 2
8 customr2 12-4-2020 Cheese 1
9 customr1 5-5-2000 lettuces 3
10 customr1 5-5-2000 French beans 3
11 customr1 5-5-2000 radishes 3
Output:
ID Date first order 1 first order 2 second order 1 second order 2 third order 1 third order 2 third order 3
0 customr1 5-5-2000 Ice Cream potatoes Cheese watermelon lettuces French beans radishes
1 customr2 12-4-2020 Cheese NaN water Milk NaN NaN NaN
2 customr2 12-9-2021 lime NaN Banana NaN NaN NaN NaN