Home > Enterprise >  Running Distinct Count in Pandas by a group
Running Distinct Count in Pandas by a group

Time:08-22

I have a dataframe 'df', with the following structure:

Input:

ID Product Price
1 P1 10
2 P1 11
3 P2 12
4 P2 12
5 P2 15

Expected Output:

ID Product Price Distinct_Running_Count
1 P1 10 1
2 P1 11 2
3 P2 12 1
4 P2 12 1
5 P2 15 2

Problem:

I want to create a new column called 'Distinct_Running_Count', with the following logic:

  • Perform a running distinct count of a column 'Product' based on price
  • Some products don't have any price change, thus 'Distinct_Running_Count' will be 1
  • Every subsequent price change, the 'Distinct_Running_Count' will be incremented

Solutions Tried:

df['Distinct_Running_Count'] = df.groupby(['Product', 'Price']).cumcount()   1
df['Distinct_Running_Count'] = df.groupby(['Product', 'Price']).transform('nunique')

Issue:

The above solution either provides running count or the total uniques counts but not what I expect

CodePudding user response:

You can try to compare the row and next row in Price column and calculate the cumsum

df['Distinct_Running_Count'] = (df.groupby(['Product'])['Price']
                                .transform(lambda col: col.ne(col.shift().fillna(col)).cumsum().add(1)))
print(df)

   ID Product  Price  Distinct_Running_Count
0   1      P1     10                       1
1   2      P1     11                       2
2   3      P2     12                       1
3   4      P2     12                       1
4   5      P2     15                       2
  • Related