*Edit: Posted code as dataframe not sheets link.
I have a large data set consisting of ~6.5M rows and 6 columns. The rows are BrandId's (e.g., 01-00058) associated with unique items and the 3 columns I need utilized are: BrandId, InventoryDate, and OnHand.
BrandID SalesPrice InventoryDate Size OnHand PurchasePrice
0 01-00058 9.28 2018-06-30 750mL 6 6.77
1 01-00058 9.28 2018-07-01 750mL 6 6.77
2 01-00058 9.28 2018-07-02 750mL 6 6.77
3 01-00058 9.28 2018-07-03 750mL 102 6.77
4 01-00058 9.28 2018-07-04 750mL 96 6.77
... ... ... ... ... ...
6531265 02-90631 12.74 2019-06-26 400mL 60 8.49
6531266 02-90631 12.74 2019-06-27 400mL 60 8.49
6531267 02-90631 12.74 2019-06-28 400mL 60 8.49
6531268 02-90631 12.74 2019-06-29 400mL 60 8.49
6531269 02-90631 12.74 2019-06-30 400mL 60 8.49
[6531270 rows x 6 columns]
I would like to determine how many days each particular BrandId has no inventory on hand. For example, BrandId 01-00058 has 27 unique days where OnHand = 0. I would like summarize that information for all unique BrandId's.
I would then like to find the mean and standard deviation of these unique BrandId's from the days each is stocked out.
Ideally, I would love this information to be viewed in the variable explorer as a table that reads:
BrandID Sum OnHand = 0
01-00058 27
01-00061 39
01-00062 14
``'
CodePudding user response:
IIUC, try with groupby
:
>>> df[df["OnHand"].eq(0)].groupby("BrandID")["InventoryDate"].nunique()