Identify and drop overlapped rows in Pandas-CodePudding

I have following data in CSV:

Person_A,Person_B,date_time,duration 
177542,923164,2021-08-01 00:00:00.000,10
966573,923134,2021-08-01 00:00:00.000,4
966568,923135,2021-08-01 00:00:00.000,12 
971582,923189,2021-08-01 00:00:00.000,30
971582,923189,2021-08-01 00:10:00.000,30 
971582,923189,2021-08-01 00:20:00.000,30

I'm trying to identify the overlapped rows, store them in a separate dataframe and remove the overlapped rows from original dataframe. In my scenario last three rows are overlapped between person A 971582 and person B 923189, I have to keep the first overlapped row and discard the remaining

Desired output:
Person_A,Person_B,date_time,duration 
177542,923164,2021-08-01 00:00:00.000,10
966573,923134,2021-08-01 00:00:00.000,4
966568,923135,2021-08-01 00:00:00.000,12
971582,923189,2021-08-01 00:00:00.000,30

CodePudding user response：

Based on your explanation, I guess this should work. This doesn't consider "date_time" and "duration".

import pandas as pd
df = pd.read_csv('data.csv') 
new_df = df.drop_duplicates(subset=['Person_A', 'Person_B'])

CodePudding user response：

You can use the pandas duplicated() and where() methods to achieve what you want.

To store the duplicated data in a new dataframe, you can do

dupe_df = df.where(df.duplicated(keep="first")).dropna()

The keep argument allows you to specify whether you want to keep the first or last entry of duplicate rows.

To remove duplicated rows from the original dataframe, simply use

df.drop_duplicates(subset=["Person_A", "Person_B"], keep="first", inplace=True)