Home > Software engineering >  remove rows from data frame depending on (date) order and Id
remove rows from data frame depending on (date) order and Id

Time:02-15

These are my haves:

haves = pd.DataFrame({'id': ['1', '2', '2'],
                   'date': ['2011-01-01', '2011-01-03', '2011-01-03'],
                   'some_value': [1, 1, 2], 
                   'flag': [0, 0, 1]})

print(haves)   
  id        date  some_value  flag
0  1  2011-01-01           1     0
1  2  2011-01-03           1     0
2  2  2011-01-03           2     1

I would like to remove rows with flag = 0, if there is a "duplicate" w.r.t. same date and id. Hope this makes sense? So using this small simple example, my wants are as follows:

wants = pd.DataFrame({'id': ['1', '2'],
                   'date': ['2011-01-01', '2011-01-03'],
                   'some_value': [1, 2]}) 

print(wants)
  id        date  some_value
0  1  2011-01-01           1
1  2  2011-01-03           2

Can this be achieved in pandas?

CodePudding user response:

Use DataFrame.duplicated chained by | for bitwise OR by second condition for test not equal by 0 by Series.ne, pass masks to boolean indexing for filtering:

df = haves[~haves.duplicated(['id','date'], keep=False) | haves['flag'].ne(0)]
print(df)
  id        date  some_value  flag
0  1  2011-01-01           1     0
2  2  2011-01-03           2     1

Or use DataFrame.pop for remove flag after testing:

df = haves[~haves.duplicated(['id','date'], keep=False) | haves.pop('flag').ne(0)]
print(df)
  id        date  some_value
0  1  2011-01-01           1
2  2  2011-01-03           2

Details:

print(~haves.duplicated(['id','date'], keep=False))
0     True
1    False
2    False
dtype: bool

print (haves.pop('flag').ne(0))
0    False
1    False
2     True
Name: flag, dtype: bool

print (~haves.duplicated(['id','date'], keep=False) | haves.pop('flag').ne(0))
0     True
1    False
2     True
dtype: bool

CodePudding user response:

You can use boolean indexing with multiple conditions.

The easiest is to keep the rows that don't match either of the conditions.

Use duplicated with keep=False to identify all duplicates, and ne for unequality to 0:

# is the row not a duplicate?
m1 = ~haves[['id', 'date']].duplicated(keep=False)
# is the row not having a 0 flag?
m2 = haves['flag'].ne(0)

# if either condition True then keep, else drop
haves[m1|m2] #.drop('flag', axis=1) # uncomment to drop "flag" column

output:

  id        date  some_value  flag
0  1  2011-01-01           1     0
2  2  2011-01-03           2     1
  • Related