I have following data in CSV:
Person_A,Person_B,date_time,duration
177542,923164,2021-08-01 00:00:00.000,10
966573,923134,2021-08-01 00:00:00.000,4
966568,923135,2021-08-01 00:00:00.000,12
971582,923189,2021-08-01 00:00:00.000,30
971582,923189,2021-08-01 00:10:00.000,30
971582,923189,2021-08-01 00:20:00.000,30
I'm trying to identify the overlapped rows, store them in a separate dataframe and remove the overlapped rows from original dataframe. In my scenario last three rows are overlapped between person A 971582 and person B 923189, I have to keep the first overlapped row and discard the remaining
Desired output:
Person_A,Person_B,date_time,duration
177542,923164,2021-08-01 00:00:00.000,10
966573,923134,2021-08-01 00:00:00.000,4
966568,923135,2021-08-01 00:00:00.000,12
971582,923189,2021-08-01 00:00:00.000,30
CodePudding user response:
Based on your explanation, I guess this should work. This doesn't consider "date_time" and "duration".
import pandas as pd
df = pd.read_csv('data.csv')
new_df = df.drop_duplicates(subset=['Person_A', 'Person_B'])
CodePudding user response:
You can use the pandas duplicated()
and where()
methods to achieve what you want.
To store the duplicated data in a new dataframe, you can do
dupe_df = df.where(df.duplicated(keep="first")).dropna()
The keep
argument allows you to specify whether you want to keep the first or last entry of duplicate rows.
To remove duplicated rows from the original dataframe, simply use
df.drop_duplicates(subset=["Person_A", "Person_B"], keep="first", inplace=True)