I have a pandas DataFrame which I want to group by column A, and check that a certain value ('test') in group B does not repeat more than once in each group.
Is there a pandas native way to do the following:
1 - find the groups where 'test' appears in column B more than once ?
2 - delete the additional occurrences (keep the one with the min value in column C).
example:
A B C
0 1 test 342
1 1 t 4556
2 1 te 222
3 1 test 56456
4 2 t 234525
5 2 te 123
6 2 test 23434
7 3 test 777
8 3 tes 665
if I groupby 'A', I get that 'test' appears twice in A==1, which is the case I would like to deal with.
CodePudding user response:
Solution for remove duplicated test
values by columns A,B
- keep first value per group:
df = df[df.B.ne('test') | ~df.duplicated(['A','B'])]
print (df)
A B C
0 1 test 342
1 1 t 4556
2 1 te 222
4 2 t 234525
5 2 te 123
6 2 test 23434
7 3 test 777
8 3 tes 665
EDIT: If need minimal C
matched test
in B
and need all possible duplicated minimal C
values compare by GroupBy.transform
with replace C
to NaN
in Series.mask
:
m = df.B.ne('test')
df = df[m | ~df.C.mask(m).groupby(df['A']).transform('min').ne(df['C'])]
But if need only first duplicated test
value use DataFrameGroupBy.idxmin
with filtered DataFrame:
m = df.B.ne('test')
m1 = df.index.isin(df[~m].groupby('A')['C'].idxmin())
df = df[m | m1]
Difference of solutions:
print (df)
A B C
-2 1 test 342
-1 1 test 342
0 1 test 342
1 1 t 4556
2 1 te 222
3 1 test 56456
4 2 t 234525
5 2 te 123
6 2 test 23434
7 3 test 777
8 3 tes 665
m = df.B.ne('test')
df1 = df[m | ~df.C.mask(m).groupby(df['A']).transform('min').ne(df['C'])]
print (df1)
A B C
-2 1 test 342
-1 1 test 342
0 1 test 342
1 1 t 4556
2 1 te 222
4 2 t 234525
5 2 te 123
6 2 test 23434
7 3 test 777
8 3 tes 665
m = df.B.ne('test')
m1 = df.index.isin(df[~m].groupby('A')['C'].idxmin())
df2 = df[m | m1]
print (df2)
A B C
-2 1 test 342
1 1 t 4556
2 1 te 222
4 2 t 234525
5 2 te 123
6 2 test 23434
7 3 test 777
8 3 tes 665