I have a dataframe that contains sales data. It looks something like this:
import pandas as pd
df = pd.DataFrame({'order_id': ['A1', 'A2', 'A3', 'A4', 'A5'],
'customer_id': ['C1', 'C2', 'C3', 'C4', 'C5'],
'store': ['Hardware1', 'Grocery3', 'Beauty5', 'Pet2', 'Electronics4'],
'price': [20.59, 38.97, 56.84, 89.88, 156.64],
'rating': [5, 4, 3,'NA',4]})
I am looking to give a promotional offer to stores that meet the following conditions:
- The store must have more than 30 ratings in the dataframe
- The store must have an average rating greater than 4
Once both conditions are met, I want to return the stores that met the above two conditions, so I know which stores could receive promotional offers.
I'm stumped at what's the best way to break down the data to accomplish this. I was thinking of starting with creating a subset of the dataframe with the data I need, which would look like:
promo = df[['store', 'rating']]
After that, I'm not sure what's best to do. I'm not sure if I should create a function that will determine an average rating and use the function with .apply() method on 'store'. However, I am not sure if a function makes sense since I don't know how to account for the store when determining average ratings. I was thinking of:
promo.groupby('store')['rating']
However, until I clean 'rating' to deal with or ignore NA values, I don't know if that makes sense either. I also thought about using .where() however, I don't know what I would define as a filter to apply to the 'store' series.
Any thoughts would be appreciated.
CodePudding user response:
You can use GroupBy.count
to count number ratings for each group and GroupBy.mean
. Pandas has GroupBy.agg
to aggregate the data. We use count
and mean
to aggregate data in each group.
#Convert 'NA' to NaN
df['rating'] = df['rating'].replace('NA', np.nan) #dtype is float here.
# To maintain ints we have to use `.astype('Int64') which supports Nullable int.
# df['rating'] = df['rating'].replace('NA', np.nan).astype('Int64') # Capital I.
# `count` doesn't include NA values.
stores = df.groupby('store')['rating'].agg(('count', 'mean')).reset_index()
m = stores['count'].gt(30) & stores['mean'].gt(4) # Stores with more than 30
# rating and avg. rating > 4
out = stores.loc[m, "store"]