I think there is a fairly simple solution to this but I just can't wrap my head around it right now.
I have a data frame with several hundred thousand orders. I am trying to find the nsmallest margin products by sub-category, but I am trying to filter out products that have been sold less than 2 separate times. e.g. if a product has only been involved in 1 transaction I don't want to include it in my output.
I would like my final output to be a frame of nsmallest margin products where 'Sub-Category' == 'Appliance' and 'Product' value count > 2.
The portion I'm getting stuck on is the "Product" value count portion. I just can't get it to work.
I've tried value_counts() as below:
df[(df['Sub-Category'] == 'Appliances') & (df.value_counts('Product Name') > 2)]\
.groupby('Product Name', as_index = False)['margin'].mean().nsmallest(20,'margin')
I've tried groupby().count()
df[(df['Sub-Category'] == 'Appliances') & (df.groupby('Product Name').count() > 2)]\
.groupby('Product Name', as_index = False)['margin'].mean().nsmallest(20,'margin')
I think I know the problem: I think the problem is that I'm filtering an aggregated frame and a non aggregated frame but I just cannot think of how to check and filter the frequency of a 'Product Name'. I'm kind of brain dead at this point.
Any suggestions?
Thanks in Advance
CodePudding user response:
I think the problem is that I'm filtering an aggregated frame and a non aggregated frame.
You are right so use map
to replace each product name by its value count.
Replace:
(df.value_counts('Product Name') > 2)
By:
(df['Product Name'].map(df.value_counts('Product Name')) > 2)
# OR
(df.groupby('Product Name')['Product Name'].transform('size') > 2)