Ultimately I am trying to find when one person performs two actions at the same time. Lets say we had a df:
df = DataFrame({
'User': ['Carl', 'Carl', 'Steve', 'Carl', 'Steve', 'Steve'],
'Action': ['Jump', 'Run', 'Jump', 'Swim', 'Run', 'Run'],
'Time': [12:01:01, 12:01:01, 12:04:03, 12:10:01, 12:17:31, 12:17:31 ]})
And that gives me a df that looks like this:
User Action Time
Carl Jump 12:01:01
Carl Run 12:01:01
Steve Jump 12:04:03
Carl Swim 12:10:01
Steve Run 12:17:31
Steve Run 12:17:31
I want to find cases where the User and Time are the same as another row, but the action is different. So in the about case I would want to return:
User Action Time
Carl Jump 12:01:01
Carl Run 12:01:01
I tried copying the df and doing:
df['dupe'] = np.where((df.user == df2.user) & (df.action!= df2.action) & (df.time == df2.time)) , 1, 0)
df = df[df.dupe == 1]
But it seems that is looking for a match in any row in the DF and I am looking for two matches and a non-match in the same row.
I have been searching previous questions for over an hour now so I figured I would just go ahead and ask. Any advice would be great.
CodePudding user response:
So we can do merge
then dropna
and filter diff
out = df.merge(df2,on = ['User','Time'],how='left').query('Action_x!=Action_y').dropna()
User Action_x Time Action_y
1 Carl Jump 12:01:01 Run
2 Carl Run 12:01:01 Jump
CodePudding user response:
You can also group by Name
and Time
and filter for groups with multiple rows and with no duplicates:
subset = df.groupby(['User', 'Time']).filter(lambda g: len(g) > 1 and not g['Action'].duplicated(keep=False).any())
Output:
>>> subset
User Action Time
0 Carl Jump 12:01:01
1 Carl Run 12:01:01
CodePudding user response:
first keep duplicate rows based on "User" and "Time", and then drop duplicate rows based on "User", "Action", "Time".