I use this rule to filter all rows where column num is unique. So I remove the duplicates:
df.drop_duplicates(subset=["num"], keep=False)
Also the same I do with column age:
df.drop_duplicates(subset=["age"], keep=False)
How to show result it another table with statistic of deleted elements like this:
Duplicates num(total) Duplicates age (total)
1 18
Thank you for answers, one remark is:
NUM AGE
1 18
2 18
3 18
4 20
As result I need to get:
NUM AGE
4 20
And extract duplicates (NUM value column) to list Python for further insertion to db
duplicatesNums = [1,2,3]
CodePudding user response:
For new DataFrame call Series.duplicated
per columns in DataFrame.apply
, count True
s by sum
and for one row DataFrame convert Series
to DataFrame
with transpose, also rename
columns:
d = {'num': [1, 1, 1, 1, 3, 1, 2, 2],
'age': [10, 10, 10, 11, 11, 98, 99, 102]}
df = pd.DataFrame(data=d)
print (df)
num age
0 1 10
1 1 10
2 1 10
3 1 11
4 3 11
5 1 98
6 2 99
7 2 102
f = lambda x: f'Duplicates {x} (total)'
df = (df[['num','age']].apply(lambda x: x.duplicated(keep=False))
.sum()
.rename(f)
.to_frame()
.T)
print (df)
Duplicates num (total) Duplicates age (total)
0 7 5
Alternative solution:
df = pd.DataFrame({f'Duplicates {x} (total)' : [df[x].duplicated(keep=False).sum()]
for x in ['num','age']})
print (df)
Duplicates num (total) Duplicates age (total)
0 7 5
EDIT: For test non duplicated rows use:
df1 = df[~df['AGE'].duplicated(keep=False)]
print (df1)
NUM AGE
3 4 20
For get duplicated values by coumn NUM
in list use:
duplicatesNums = df.loc[df['AGE'].duplicated(keep=False), 'NUM'].tolist()
print (duplicatesNums)
[1, 2, 3]
If NUM
is index:
print (df)
AGE
NUM
1 18
2 18
3 18
4 20
duplicatesNums = df.index[df['AGE'].duplicated(keep=False)].tolist()
print (duplicatesNums)
[1, 2, 3]
CodePudding user response:
IIUC, you can use duplicated
in apply
, with sum
to count the values:
df[['num', 'age']].apply(lambda c: c.duplicated(keep=False).sum())
dummy example:
df = pd.DataFrame({'num': list('AABCDD'), 'age': list('112345')})
df[['num', 'age']].apply(lambda c: c.duplicated(keep=False).sum())
output:
num 4
age 2
dtype: int64
extra question:
# identify duplicates
mask = df['AGE'].duplicated(keep=False)
# get indices
ids = mask[mask].index.to_list()
# [0, 1, 2]
# filter DataFrame
df2 = df[~mask]
# NUM AGE
# 3 4 20