I have a dataframe 'df', with the following structure:
Input:
ID | Product | Price |
---|---|---|
1 | P1 | 10 |
2 | P1 | 11 |
3 | P2 | 12 |
4 | P2 | 12 |
5 | P2 | 15 |
Expected Output:
ID | Product | Price | Distinct_Running_Count |
---|---|---|---|
1 | P1 | 10 | 1 |
2 | P1 | 11 | 2 |
3 | P2 | 12 | 1 |
4 | P2 | 12 | 1 |
5 | P2 | 15 | 2 |
Problem:
I want to create a new column called 'Distinct_Running_Count', with the following logic:
- Perform a running distinct count of a column 'Product' based on price
- Some products don't have any price change, thus 'Distinct_Running_Count' will be 1
- Every subsequent price change, the 'Distinct_Running_Count' will be incremented
Solutions Tried:
df['Distinct_Running_Count'] = df.groupby(['Product', 'Price']).cumcount() 1
df['Distinct_Running_Count'] = df.groupby(['Product', 'Price']).transform('nunique')
Issue:
The above solution either provides running count or the total uniques counts but not what I expect
CodePudding user response:
You can try to compare the row and next row in Price
column and calculate the cumsum
df['Distinct_Running_Count'] = (df.groupby(['Product'])['Price']
.transform(lambda col: col.ne(col.shift().fillna(col)).cumsum().add(1)))
print(df)
ID Product Price Distinct_Running_Count
0 1 P1 10 1
1 2 P1 11 2
2 3 P2 12 1
3 4 P2 12 1
4 5 P2 15 2