I want to count the number of unique rows in a pandas dataframe and add a new row as count_index as in example. In another way, I want to duplicate the index for duplicate rows.
import pandas as pd
df = {'A': [ 8,8,9,9,9,12,12,13,15,15,15],
'B': [ 1,1,2,2,2,11,11,3,4,4,4],
'C': [ 10,10,20,20,20,101,101,30,40,40,40],
'D': [81,81,92,92,92,121,121,134,150,150,150]}
df = pd.DataFrame(df)
print(df.groupby(['A','B','C','D']).size())
#####################################################
#input
A B C D
8 1 10 81
8 1 10 81
9 2 20 92
9 2 20 92
9 2 20 92
12 11 101 121
12 11 101 121
13 3 30 134
15 4 40 150
15 4 40 150
15 4 40 150
####################################################
#expected output
A B C D Count_index
8 1 10 81 1
8 1 10 81 1
9 2 20 92 2
9 2 20 92 2
9 2 20 92 2
12 11 101 121 3
12 11 101 121 3
13 3 30 134 4
15 4 40 150 5
15 4 40 150 5
15 4 40 150 5
CodePudding user response:
You can do this by counting the number of inverted .duplicated
s. We can then use a cumulative to keep an ongoing count of the number of encountered unique rows.
df['count_index'] = (~df.duplicated(keep="first")).cumsum()
print(df)
A B C D count_index
0 8 1 10 81 1
1 8 1 10 81 1
2 9 2 20 92 2
3 9 2 20 92 2
4 9 2 20 92 2
5 12 11 101 121 3
6 12 11 101 121 3
7 13 3 30 134 4
8 15 4 40 150 5
9 15 4 40 150 5
10 15 4 40 150 5
CodePudding user response:
You can use a combination of diff().ne(0)
or df.ne(df.shift())
df.diff().ne(0).all(axis=1).cumsum()
or
df.ne(df.shift()).all(axis=1).cumsum()
Output:
0 1
1 1
2 2
3 2
4 2
5 3
6 3
7 4
8 5
9 5
10 5