Home > Blockchain >  Count number of unique rows pandas
Count number of unique rows pandas

Time:08-30

I want to count the number of unique rows in a pandas dataframe and add a new row as count_index as in example. In another way, I want to duplicate the index for duplicate rows.

import pandas as pd
df = {'A': [ 8,8,9,9,9,12,12,13,15,15,15],
      'B': [ 1,1,2,2,2,11,11,3,4,4,4],
      'C': [ 10,10,20,20,20,101,101,30,40,40,40],
      'D': [81,81,92,92,92,121,121,134,150,150,150]}
df = pd.DataFrame(df)

print(df.groupby(['A','B','C','D']).size())
#####################################################
      #input
   A    B      C     D
   8    1      10    81 
   8    1      10    81 
   9    2      20    92 
   9    2      20    92 
   9    2      20    92 
  12   11     101   121 
  12   11     101   121 
  13    3      30   134 
  15    4      40   150 
  15    4      40   150 
  15    4      40   150 
 ####################################################
#expected output
    A    B      C     D   Count_index
   8    1      10    81    1
   8    1      10    81    1
   9    2      20    92    2
   9    2      20    92    2  
   9    2      20    92    2
  12   11     101   121    3
  12   11     101   121    3
  13    3      30   134    4
  15    4      40   150    5
  15    4      40   150    5
  15    4      40   150    5

CodePudding user response:

You can do this by counting the number of inverted .duplicateds. We can then use a cumulative to keep an ongoing count of the number of encountered unique rows.

df['count_index'] = (~df.duplicated(keep="first")).cumsum()

print(df)
     A   B    C    D  count_index
0    8   1   10   81            1
1    8   1   10   81            1
2    9   2   20   92            2
3    9   2   20   92            2
4    9   2   20   92            2
5   12  11  101  121            3
6   12  11  101  121            3
7   13   3   30  134            4
8   15   4   40  150            5
9   15   4   40  150            5
10  15   4   40  150            5


CodePudding user response:

You can use a combination of diff().ne(0) or df.ne(df.shift())

df.diff().ne(0).all(axis=1).cumsum()

or

df.ne(df.shift()).all(axis=1).cumsum()

Output:

0     1
1     1
2     2
3     2
4     2
5     3
6     3
7     4
8     5
9     5
10    5
  • Related