How do I filter data in one series using another series as the filter?-CodePudding

I have a dataframe that contains sales data. It looks something like this:

import pandas as pd
df = pd.DataFrame({'order_id': ['A1', 'A2', 'A3', 'A4', 'A5'], 
                   'customer_id': ['C1', 'C2', 'C3', 'C4', 'C5'], 
                   'store': ['Hardware1', 'Grocery3', 'Beauty5', 'Pet2', 'Electronics4'],
                   'price': [20.59, 38.97, 56.84, 89.88, 156.64],
                   'rating': [5, 4, 3,'NA',4]})

I am looking to give a promotional offer to stores that meet the following conditions:

The store must have more than 30 ratings in the dataframe
The store must have an average rating greater than 4

Once both conditions are met, I want to return the stores that met the above two conditions, so I know which stores could receive promotional offers.

I'm stumped at what's the best way to break down the data to accomplish this. I was thinking of starting with creating a subset of the dataframe with the data I need, which would look like:

promo = df[['store', 'rating']]

After that, I'm not sure what's best to do. I'm not sure if I should create a function that will determine an average rating and use the function with .apply() method on 'store'. However, I am not sure if a function makes sense since I don't know how to account for the store when determining average ratings. I was thinking of:

promo.groupby('store')['rating']

However, until I clean 'rating' to deal with or ignore NA values, I don't know if that makes sense either. I also thought about using .where() however, I don't know what I would define as a filter to apply to the 'store' series.

Any thoughts would be appreciated.

CodePudding user response：

You can use GroupBy.count to count number ratings for each group and GroupBy.mean. Pandas has GroupBy.agg to aggregate the data. We use count and mean to aggregate data in each group.

#Convert 'NA' to NaN
df['rating'] = df['rating'].replace('NA', np.nan) #dtype is float here.
# To maintain ints we have to use `.astype('Int64') which supports Nullable int.
# df['rating'] = df['rating'].replace('NA', np.nan).astype('Int64') # Capital I.

# `count` doesn't include NA values.
stores = df.groupby('store')['rating'].agg(('count', 'mean')).reset_index()
m = stores['count'].gt(30) & stores['mean'].gt(4) # Stores with more than 30
                                                  # rating and avg. rating > 4
out = stores.loc[m, "store"]