Home > front end >  Duplicate count for a column in Pandas dataframe
Duplicate count for a column in Pandas dataframe

Time:10-25

I have a dataframe df like below:

import pandas as pd

data = {'A': ['XYZ', 'XYZ', 'XYZ', 'XYZ', 'PQR', 'XYZ', 'XYZ', 'ABC', 'XYZ', 'ABC'], 'B': ['2022-02-16 14:00:31', '2022-02-16 16:11:26', '2022-02-16 17:31:26',
                                                                                           '2022-02-16 22:47:46', '2022-02-17 07:11:11', '2022-02-17 10:43:36', 
                                                                                           '2022-02-17 15:05:11', '2022-02-17 15:07:25', '2022-02-17 15:08:35', 
                                                                                           '2022-02-17 15:09:46'], 'C': [1,0,0,0,1,0,0,1,0,0]}
df = pd.DataFrame(data)
df['B'] = pd.to_datetime(df['B'])
df


     |   A   |          B           |     C      |  
      ------- ---------------------- ------------ 
     |  XYZ  |  2022-02-16 14:00:31 |     1      |
     |  XYZ  |  2022-02-16 16:11:26 |     0      |
     |  XYZ  |  2022-02-16 17:31:26 |     0      |
     |  XYZ  |  2022-02-16 22:47:46 |     0      |
     |  PQR  |  2022-02-17 07:11:11 |     1      |
     |  XYZ  |  2022-02-17 10:43:36 |     0      |
     |  XYZ  |  2022-02-17 15:05:11 |     0      |
     |  ABC  |  2022-02-17 15:07:25 |     1      |
     |  XYZ  |  2022-02-17 15:08:35 |     0      |
     |  ABC  |  2022-02-17 15:09:46 |     0      |
      ------- ---------------------- ------------ 

What I want to achieve is that I want to count the number of duplicates for each of the occurences of XYZ, PQR, ABC such that I get an output like below.


Expected Output :

     |   A   |          B           |     C      |  Count   |   
      ------- ---------------------- ------------ ---------- 
     |  XYZ  |  2022-02-16 14:00:31 |     1      |    7     |
     |  XYZ  |  2022-02-16 16:11:26 |     0      |          |
     |  XYZ  |  2022-02-16 17:31:26 |     0      |          |
     |  XYZ  |  2022-02-16 22:47:46 |     0      |          |
     |  PQR  |  2022-02-17 07:11:11 |     1      |    1     |
     |  XYZ  |  2022-02-17 10:43:36 |     0      |          |
     |  XYZ  |  2022-02-17 15:05:11 |     0      |          |
     |  ABC  |  2022-02-17 15:07:25 |     1      |    2     |
     |  XYZ  |  2022-02-17 15:08:35 |     0      |          |
     |  ABC  |  2022-02-17 15:09:45 |     0      |          |
      ------- ---------------------- ------------ ---------- 

Currently, I'm trying to achieve the same by using the code below but I'm unable to get expected/desired results.


one_index = df[df['C'] == 1].index
zero_index = df[df['C'] == 0].index

df.loc[0, 'Count'] = len(df)


     |   A   |          B           |     C      |  Count   |   
      ------- ---------------------- ------------ ---------- 
     |  XYZ  |  2022-02-16 14:00:31 |     1      |    10    |
     |  XYZ  |  2022-02-16 16:11:26 |     0      |          |
     |  XYZ  |  2022-02-16 17:31:26 |     0      |          |
     |  XYZ  |  2022-02-16 22:47:46 |     0      |          |
     |  PQR  |  2022-02-17 07:11:11 |     1      |          |
     |  XYZ  |  2022-02-17 10:43:36 |     0      |          |
     |  XYZ  |  2022-02-17 15:05:11 |     0      |          |
     |  ABC  |  2022-02-17 15:07:25 |     1      |          |
     |  XYZ  |  2022-02-17 15:08:35 |     0      |          |
     |  ABC  |  2022-02-17 15:09:45 |     0      |          |
      ------- ---------------------- ------------ ---------- 

So, how can I get count of duplicates for each of the values of column A as mentioned ?

EDIT (OPTIONAL):

I would also like to have ID assigned to the groups for the df after the count values has been assigned. So, my final dataframe should look like below after ID has been assigned:

    |   A   |          B           |     C      |  Count   |   ID  |
     ------- ---------------------- ------------ ---------- ------- 
    |  XYZ  |  2022-02-16 14:00:31 |     1      |    7     | ABC_1 |
    |  XYZ  |  2022-02-16 16:11:26 |     0      |          |       |
    |  XYZ  |  2022-02-16 17:31:26 |     0      |          |       |
    |  XYZ  |  2022-02-16 22:47:46 |     0      |          |       |
    |  PQR  |  2022-02-17 07:11:11 |     1      |    1     | ABC_2 |
    |  XYZ  |  2022-02-17 10:43:36 |     0      |          |       | 
    |  XYZ  |  2022-02-17 15:05:11 |     0      |          |       |
    |  ABC  |  2022-02-17 15:07:25 |     1      |    2     | ABC_3 |
    |  XYZ  |  2022-02-17 15:08:35 |     0      |          |       | 
    |  ABC  |  2022-02-17 15:09:45 |     0      |          |       |
     ------- ---------------------- ------------ ---------- ------- 

CodePudding user response:

Use GroupBy.transform with set empty strings for non 1 values in C by Series.where:

df['B'] = pd.to_datetime(df['B'])

m = df.C.eq(1)
df['Count'] = df.groupby('A')['C'].transform('size').where(m, '')
df.loc[m, 'ID'] = 'ABC_'   pd.RangeIndex(1, m.sum()   1).astype(str)
df['ID'] = df['ID'].fillna('')
print (df)
     A                   B  C Count     ID
0  XYZ 2022-02-16 14:00:31  1     7  ABC_1
1  XYZ 2022-02-16 16:11:26  0             
2  XYZ 2022-02-16 17:31:26  0             
3  XYZ 2022-02-16 22:47:46  0             
4  PQR 2022-02-17 07:11:11  1     1  ABC_2
5  XYZ 2022-02-17 10:43:36  0             
6  XYZ 2022-02-17 15:05:11  0             
7  ABC 2022-02-17 15:07:25  1     2  ABC_3
8  XYZ 2022-02-17 15:08:35  0             
9  ABC 2022-02-17 15:09:46  0         

Or:

df['B'] = pd.to_datetime(df['B'])

m = df.C.eq(1)
df['Count'] = df.groupby('A')['C'].transform('size').where(m, '')
df['ID'] = ('ABC_'   df['C'].cumsum().astype(str)).where(m, '')

print (df)
     A                   B  C Count     ID
0  XYZ 2022-02-16 14:00:31  1     7  ABC_1
1  XYZ 2022-02-16 16:11:26  0             
2  XYZ 2022-02-16 17:31:26  0             
3  XYZ 2022-02-16 22:47:46  0             
4  PQR 2022-02-17 07:11:11  1     1  ABC_2
5  XYZ 2022-02-17 10:43:36  0             
6  XYZ 2022-02-17 15:05:11  0             
7  ABC 2022-02-17 15:07:25  1     2  ABC_3
8  XYZ 2022-02-17 15:08:35  0             
9  ABC 2022-02-17 15:09:46  0                 

CodePudding user response:

Use value_counts and map with boolean indexing:

df['Count'] = df.loc[df['C'].eq(1), 'A'].map(df['A'].value_counts())

output:

     A                   B  C  Count
0  XYZ 2022-02-16 14:00:31  1    7.0
1  XYZ 2022-02-16 16:11:26  0    NaN
2  XYZ 2022-02-16 17:31:26  0    NaN
3  XYZ 2022-02-16 22:47:46  0    NaN
4  PQR 2022-02-17 07:11:11  1    1.0
5  XYZ 2022-02-17 10:43:36  0    NaN
6  XYZ 2022-02-17 15:05:11  0    NaN
7  ABC 2022-02-17 15:07:25  1    2.0
8  XYZ 2022-02-17 15:08:35  0    NaN
9  ABC 2022-02-17 15:09:46  0    NaN
  • Related