I have dataframe that contains Several Columns: Customerid, OrderNumber, PartNumber,Description,Revenue. Something like this:
CustomerId OrderNumber PartNumber Description Revenue
0 6512345 1 ABC1 KitKat 2
1 6512345 1 ABC2 Chips 3
2 6512345 1 ABC3 Coke 2
3 4213124 2 GFA1 Sprite 2
4 2132142 3 FGA1 Beer 3
5 3242342 4 DCA1 Mentos 4
6 3242342 4 DCA2 Fanta 2
7 8768655 5 ABC1 KitKat 2
What I need is to filter this dataframe to contain only orders where the number of item for that order is one. So in this case those would be order with OrderNumber 2, 3 and 5 since these are the one with single transaction.
So since I have few rows of data and i know that here i need to exclude OrderNumbers 1 and 4 i can use this function:
sample = pd.read_excel('../data/Sample.xlsx')
sample = sample[(sample.OrderNumber != 1) & (sample.OrderNumber != 4)]
sample
Which gives me the output like this:
CustomerId OrderNumber PartNumber Description Revenue
3 4213124 2 GFA1 Sprite 2
4 2132142 3 FGA1 Beer 3
7 8768655 5 ABC1 KitKat 2
Now i have over 100k rows and i need to be able to create funciton that will for each order check how many items is related to that order and then exclude the orders that have more than one item.
How would i do that?
CodePudding user response:
One quick way is to use groupby.filter
:
df.groupby('OrderNumber').filter(lambda g: len(g) == 1)
CustomerId OrderNumber PartNumber Description Revenue
3 4213124 2 GFA1 Sprite 2
4 2132142 3 FGA1 Beer 3
7 8768655 5 ABC1 KitKat 2
Or filter by transform('size')
:
df[df.CustomerId.groupby(df.OrderNumber).transform('size') == 1]
CustomerId OrderNumber PartNumber Description Revenue
3 4213124 2 GFA1 Sprite 2
4 2132142 3 FGA1 Beer 3
7 8768655 5 ABC1 KitKat 2