I have a df of different merchants, products and a popularity rankings, which are uneven in size.
Merchant | Product | Rank |
---|---|---|
Merchant 1 | apple | 1 |
Merchant 1 | banana | 2 |
Merchant 1 | mango | 3 |
Merchant 1 | grapes | 4 |
Merchant 1 | kiwi | 5 |
Merchant 2 | orange | 1 |
Merchant 2 | apple | 2 |
Merchant 2 | banana | 3 |
Merchant 2 | grapes | 4 |
Merchant 3 | mango | 1 |
Merchant 3 | grapes | 2 |
Merchant 3 | orange | 3 |
I want to count the number of cumulative matching products per merchant, then return an adjusted rank (which is the cum total 1). But I only want to count the first group of matching products and only if they start at 1, with the adjusted rank resetting to 1 if the total is not > 2.
In this example the matching pattern would be apple|banana|orange
The ideal output would be a dataframe with the merchant, rank and adjusted rank:
Merchant | Rank | Adjusted Rank |
---|---|---|
Merchant 1 | 1 | 3 |
Merchant 2 | 1 | 4 |
Merchant 3 | 1 | 1 |
With some help from this post, i've managed to solve the problem. But what I've done feels quite inefficient. Is there a better way to accomplish this?
df = pd.read_csv('data.csv')
pattern = 'apple|banana|orange'
# Check column contains substring and create bool column
df['Match'] = np.where(df['Product'].str.contains(pattern), True, False)
# perform count on True values, resetting when False
df['Count'] = df.groupby(df['Match'].astype(int).diff().ne(0).cumsum())['Match'].cumsum()
# filter out False values
filtered = df[df['Match'] != False]
# filter out values where rank != count, as these aren't in the 1st grouping
filtered = filtered.loc[filtered['Rank'] == filtered['Count']]
# get max rank from remaining values
filtered = filtered.groupby('Merchant')['Count'].max().reset_index(name='Adjusted Rank')
# add 1 to ranks to get maximum position, reset ranking to 1 if there aren't 2 concurrent values
filtered['Adjusted Rank'] = np.where(filtered['Adjusted Rank'] > 2, (filtered['Adjusted Rank'] 1), 1)
# merge dfs, drop columns, drop dupes
df = df.merge(filtered, how='left', on='Merchant').fillna(1)
df = df[['Merchant','Rank','Adjusted Rank']]
df.drop_duplicates('Merchant', inplace=True)
Thanks for any help!
CodePudding user response:
You can use:
products = ['apple', 'banana', 'orange']
out = (df
.groupby('Merchant', as_index=False)
.agg(**{'Adjusted Rank': ('Product', lambda s: s.isin(products).cummin().sum() 1)})
)
output:
Merchant Adjusted Rank
0 Merchant 1 3
1 Merchant 2 4
2 Merchant 3 1
rank to 1 if Adjusted Rank <= 2:
out['Rank'] = out['Adjusted Rank'].where(out['Adjusted Rank'].gt(2), 1)