Home > Enterprise >  Finding rows in a pandas data frame where specific columns match and others do not
Finding rows in a pandas data frame where specific columns match and others do not

Time:12-17

Ultimately I am trying to find when one person performs two actions at the same time. Lets say we had a df:

df = DataFrame({
'User': ['Carl', 'Carl', 'Steve', 'Carl', 'Steve', 'Steve'],
'Action': ['Jump', 'Run', 'Jump', 'Swim', 'Run', 'Run'],
'Time': [12:01:01, 12:01:01, 12:04:03, 12:10:01, 12:17:31, 12:17:31 ]})

And that gives me a df that looks like this:

User    Action  Time
Carl    Jump    12:01:01
Carl    Run     12:01:01
Steve   Jump    12:04:03
Carl    Swim    12:10:01
Steve   Run     12:17:31
Steve   Run     12:17:31

I want to find cases where the User and Time are the same as another row, but the action is different. So in the about case I would want to return:

User    Action  Time
Carl    Jump    12:01:01
Carl    Run     12:01:01

I tried copying the df and doing:

df['dupe'] = np.where((df.user == df2.user) & (df.action!= df2.action) & (df.time == df2.time)) , 1, 0)
df = df[df.dupe == 1]

But it seems that is looking for a match in any row in the DF and I am looking for two matches and a non-match in the same row.

I have been searching previous questions for over an hour now so I figured I would just go ahead and ask. Any advice would be great.

CodePudding user response:

So we can do merge then dropna and filter diff

out = df.merge(df2,on = ['User','Time'],how='left').query('Action_x!=Action_y').dropna()
   User Action_x      Time Action_y
1  Carl     Jump  12:01:01      Run
2  Carl      Run  12:01:01     Jump

CodePudding user response:

You can also group by Name and Time and filter for groups with multiple rows and with no duplicates:

subset = df.groupby(['User', 'Time']).filter(lambda g: len(g) > 1 and not g['Action'].duplicated(keep=False).any())

Output:

>>> subset
   User Action      Time
0  Carl   Jump  12:01:01
1  Carl    Run  12:01:01

CodePudding user response:

first keep duplicate rows based on "User" and "Time", and then drop duplicate rows based on "User", "Action", "Time".

  • Related