Home > Software engineering >  How to count duplicates in column Pandas?
How to count duplicates in column Pandas?

Time:05-24

I use this rule to filter all rows where column num is unique. So I remove the duplicates:

df.drop_duplicates(subset=["num"], keep=False)

Also the same I do with column age:

df.drop_duplicates(subset=["age"], keep=False)

How to show result it another table with statistic of deleted elements like this:

Duplicates num(total)  Duplicates age (total)
1                      18

Thank you for answers, one remark is:

NUM AGE
1        18
2        18
3        18
4        20

As result I need to get:

NUM AGE
4        20

And extract duplicates (NUM value column) to list Python for further insertion to db duplicatesNums = [1,2,3]

CodePudding user response:

For new DataFrame call Series.duplicated per columns in DataFrame.apply, count Trues by sum and for one row DataFrame convert Series to DataFrame with transpose, also rename columns:

d = {'num': [1, 1, 1, 1, 3, 1, 2, 2],
      'age': [10, 10, 10, 11, 11, 98, 99, 102]}
df = pd.DataFrame(data=d)
print (df)
   num  age
0    1   10
1    1   10
2    1   10
3    1   11
4    3   11
5    1   98
6    2   99
7    2  102

f = lambda x: f'Duplicates {x} (total)'
df = (df[['num','age']].apply(lambda x: x.duplicated(keep=False))
                       .sum()
                       .rename(f)
                       .to_frame()
                       .T)
    
print (df)
   Duplicates num (total)  Duplicates age (total)
0                       7                       5

Alternative solution:

df = pd.DataFrame({f'Duplicates {x} (total)' : [df[x].duplicated(keep=False).sum()] 
                    for x in ['num','age']})
   
print (df)

   Duplicates num (total)  Duplicates age (total)
0                       7                       5

EDIT: For test non duplicated rows use:

df1 = df[~df['AGE'].duplicated(keep=False)]
print (df1)
   NUM  AGE
3    4   20

For get duplicated values by coumn NUM in list use:

duplicatesNums = df.loc[df['AGE'].duplicated(keep=False), 'NUM'].tolist()
print (duplicatesNums)
[1, 2, 3]

If NUM is index:

print (df)
     AGE
NUM     
1     18
2     18
3     18
4     20

duplicatesNums = df.index[df['AGE'].duplicated(keep=False)].tolist()
print (duplicatesNums)
[1, 2, 3]

CodePudding user response:

IIUC, you can use duplicated in apply, with sum to count the values:

df[['num', 'age']].apply(lambda c: c.duplicated(keep=False).sum())

dummy example:

df = pd.DataFrame({'num': list('AABCDD'), 'age': list('112345')})
df[['num', 'age']].apply(lambda c: c.duplicated(keep=False).sum())

output:

num    4
age    2
dtype: int64
extra question:
# identify duplicates
mask = df['AGE'].duplicated(keep=False)
# get indices
ids = mask[mask].index.to_list()
# [0, 1, 2]

# filter DataFrame
df2 = df[~mask]
#     NUM  AGE
#  3    4   20
  • Related