These are my haves:
haves = pd.DataFrame({'id': ['1', '2', '2'],
'date': ['2011-01-01', '2011-01-03', '2011-01-03'],
'some_value': [1, 1, 2],
'flag': [0, 0, 1]})
print(haves)
id date some_value flag
0 1 2011-01-01 1 0
1 2 2011-01-03 1 0
2 2 2011-01-03 2 1
I would like to remove rows with flag = 0, if there is a "duplicate" w.r.t. same date and id. Hope this makes sense? So using this small simple example, my wants are as follows:
wants = pd.DataFrame({'id': ['1', '2'],
'date': ['2011-01-01', '2011-01-03'],
'some_value': [1, 2]})
print(wants)
id date some_value
0 1 2011-01-01 1
1 2 2011-01-03 2
Can this be achieved in pandas?
CodePudding user response:
Use DataFrame.duplicated
chained by |
for bitwise OR by second condition for test not equal by 0
by Series.ne
, pass masks to boolean indexing
for filtering:
df = haves[~haves.duplicated(['id','date'], keep=False) | haves['flag'].ne(0)]
print(df)
id date some_value flag
0 1 2011-01-01 1 0
2 2 2011-01-03 2 1
Or use DataFrame.pop
for remove flag
after testing:
df = haves[~haves.duplicated(['id','date'], keep=False) | haves.pop('flag').ne(0)]
print(df)
id date some_value
0 1 2011-01-01 1
2 2 2011-01-03 2
Details:
print(~haves.duplicated(['id','date'], keep=False))
0 True
1 False
2 False
dtype: bool
print (haves.pop('flag').ne(0))
0 False
1 False
2 True
Name: flag, dtype: bool
print (~haves.duplicated(['id','date'], keep=False) | haves.pop('flag').ne(0))
0 True
1 False
2 True
dtype: bool
CodePudding user response:
You can use boolean indexing with multiple conditions.
The easiest is to keep the rows that don't match either of the conditions.
Use duplicated
with keep=False
to identify all duplicates, and ne
for unequality to 0:
# is the row not a duplicate?
m1 = ~haves[['id', 'date']].duplicated(keep=False)
# is the row not having a 0 flag?
m2 = haves['flag'].ne(0)
# if either condition True then keep, else drop
haves[m1|m2] #.drop('flag', axis=1) # uncomment to drop "flag" column
output:
id date some_value flag
0 1 2011-01-01 1 0
2 2 2011-01-03 2 1