I have a dataframe accounting different LEGO pieces contained in each of my LEGO set boxes. For each set box, there are always many different regular pieces, but somemtimes the box contains also some additional spare pieces. So the dataframe has a boolean column to distinguish that condition.
Now I want to summarize the dataset so I get just one row per LEGO set (groupby set_id) with a new column for the total amount of pieces in that set box (aggregated sum of "quantity").
My problem is that I also want two additional columns for accounting how many of those pieces are "regular" and how many are "spare", based on the True/False column.
Is there any way of calculating those three sum columns by creating just one additional dataframe and just one .agg() call?
Instead of creating 3 dataframes and merging columns, which is my current approach:
import pandas as pd
import random
random.seed(1)
# creating sample data:
nrows=15
df = pd.DataFrame([], columns=["set_id","part_id","quantity","is_spare"])
df["set_id"]=["ABC"[random.randint(0,2)] for r in range(0,nrows)]
df["part_id"] = [random.randint(1000,8000) for n in range(0,nrows)]
df["quantity"] = [random.randint(1,10) for n in range(0,nrows)]
df["is_spare"]=[random.random()>0.75 for r in range(0,nrows)]
print(df)
# grouping into a new dfsummary dataframe: HOW TO DO IT IN JUST ONE STEP ?
# aggregate sum of ALL pieces:
dfsummary = df.groupby("set_id", as_index=False) \
.agg(num_pieces=("quantity","sum"))
# aggregate sum of "normal" pieces:
dfsummary2 = df.loc[df["is_spare"]==False].groupby("set_id", as_index=False) \
.agg(normal_pieces=("quantity","sum"))
# aggregate sum of "spare" pieces:
dfsummary3 = df.loc[df["is_spare"]==True].groupby("set_id", as_index=False) \
.agg(spare_pieces=("quantity","sum"))
# Putting all aggregate columns together:
dfsummary = dfsummary \
.merge(dfsummary2,on="set_id",how="left") \
.merge(dfsummary3,on="set_id",how="left")
print(dfsummary)
ORIGINAL DATA:
set_id part_id quantity is_spare
0 A 4545 1 False
1 C 5976 1 False
2 A 7244 9 False
3 B 7284 1 False
4 A 1017 7 False
5 B 6700 4 True
6 B 4648 7 False
7 B 3181 1 False
8 C 6910 9 False
9 B 7568 4 True
10 A 2874 8 True
11 A 5842 8 False
12 B 1837 9 False
13 A 3600 4 False
14 B 1250 6 False
SUMMARIZED DATA:
set_id num_pieces normal_pieces spare_pieces
0 A 37 29 8.0
1 B 32 24 8.0
2 C 10 10 NaN
I saw this Stackoverflow question, but my case is somehow different because the sum() functions would only be applied to some rows of the target column depending on other column's True/False values.
CodePudding user response:
You can do it in one line. The trick is to create a temporary column where quantity is negative for spare_pieces
and positive for normal_pieces
:
out = df.assign(qty=df['is_spare'].replace({True: -1, False: 1}) * df['quantity']) \
.groupby('set_id')['qty'] \
.agg(num_pieces=lambda x: sum(abs(x)),
normal_pieces=lambda x: sum(x[x > 0]),
sparse_pieces=lambda x: abs(sum(x[x < 0]))) \
.reset_index()
Output:
>>> out
set_id num_pieces normal_pieces sparse_pieces
0 A 37 29 8
1 B 32 24 8
2 C 10 10 0
>>> df['is_spare'].replace({True: -1, False: 1}) * df['quantity'])
0 1 # normal_pieces
1 1
2 9
3 1
4 7
5 -4 # spare_pieces
6 7
7 1
8 9
9 -4
10 -8
11 8
12 9
13 4
14 6
dtype: int64
CodePudding user response:
One option is to do a groupby and unstack:
(df
.groupby(['set_id', 'is_spare'])
.quantity
.sum()
.unstack('is_spare')
.rename(columns={False:'normal_pieces', True:'spare_pieces'})
.assign(num_pieces = lambda df: df.sum(axis = 'columns'))
.rename_axis(columns=None)
.reset_index()
)
set_id normal_pieces spare_pieces num_pieces
0 A 29.0 8.0 37.0
1 B 24.0 8.0 32.0
2 C 10.0 NaN 10.0