Home > database >  Perform a cumulative count of matching values within a column based on multiple conditions and a gro
Perform a cumulative count of matching values within a column based on multiple conditions and a gro

Time:12-09

I have a df of different merchants, products and a popularity rankings, which are uneven in size.

Merchant Product Rank
Merchant 1 apple 1
Merchant 1 banana 2
Merchant 1 mango 3
Merchant 1 grapes 4
Merchant 1 kiwi 5
Merchant 2 orange 1
Merchant 2 apple 2
Merchant 2 banana 3
Merchant 2 grapes 4
Merchant 3 mango 1
Merchant 3 grapes 2
Merchant 3 orange 3

I want to count the number of cumulative matching products per merchant, then return an adjusted rank (which is the cum total 1). But I only want to count the first group of matching products and only if they start at 1, with the adjusted rank resetting to 1 if the total is not > 2.

In this example the matching pattern would be apple|banana|orange

The ideal output would be a dataframe with the merchant, rank and adjusted rank:

Merchant Rank Adjusted Rank
Merchant 1 1 3
Merchant 2 1 4
Merchant 3 1 1

With some help from this post, i've managed to solve the problem. But what I've done feels quite inefficient. Is there a better way to accomplish this?

df = pd.read_csv('data.csv')

pattern = 'apple|banana|orange'

# Check column contains substring and create bool column
df['Match'] = np.where(df['Product'].str.contains(pattern), True, False)

# perform count on True values, resetting when False 
df['Count'] = df.groupby(df['Match'].astype(int).diff().ne(0).cumsum())['Match'].cumsum()

# filter out False values

filtered = df[df['Match'] != False]

# filter out values where rank != count, as these aren't in the 1st grouping

filtered = filtered.loc[filtered['Rank'] == filtered['Count']]

# get max rank from remaining values

filtered = filtered.groupby('Merchant')['Count'].max().reset_index(name='Adjusted Rank')

# add 1 to ranks to get maximum position, reset ranking to 1 if there aren't 2 concurrent values

filtered['Adjusted Rank'] = np.where(filtered['Adjusted Rank'] > 2, (filtered['Adjusted Rank']   1), 1)

# merge dfs, drop columns, drop dupes

df = df.merge(filtered, how='left', on='Merchant').fillna(1)

df = df[['Merchant','Rank','Adjusted Rank']]

df.drop_duplicates('Merchant', inplace=True)

Thanks for any help!

CodePudding user response:

You can use:

products = ['apple', 'banana', 'orange']

out = (df
 .groupby('Merchant', as_index=False)
 .agg(**{'Adjusted Rank': ('Product', lambda s: s.isin(products).cummin().sum() 1)})
)

output:

     Merchant  Adjusted Rank
0  Merchant 1              3
1  Merchant 2              4
2  Merchant 3              1

rank to 1 if Adjusted Rank <= 2:

out['Rank'] = out['Adjusted Rank'].where(out['Adjusted Rank'].gt(2), 1)
  • Related