I have dataset which has list of transition details. There are wrong entry which has same or ongoing transition date that should not happened which need to remove.
df = pd.DataFrame({
"Name":["A","B","C","A"],
"Dep_station":["Delhi","Bangalore","Chennai","Delhi"],
"Arr_station":["Bangalore","Chennai","Delhi","Bangalore"],
"Dep_time":["02-01-2022 10:00:00", "02-02-2022 10:00:00", "02-03-2022 10:00:00", "03-01-2022 10:00:00"],
"Arr_time":["04-01-2022 22:00:00", "04-02-2022 19:00:00", "05-03-2022 18:00:00", "05-01-2022 22:00:00"]
})
df looks like this -
Name Dep_station Arr_station Dep_time Arr_time
0 A Delhi Bangalore 02-01-2022 10:00:00 04-01-2022 22:00:00
1 B Bangalore Chennai 02-02-2022 10:00:00 04-02-2022 19:00:00
2 C Chennai Delhi 02-03-2022 10:00:00 05-03-2022 18:00:00
3 A Delhi Bangalore 03-01-2022 10:00:00 05-01-2022 22:00:00
If you notice this A is already on travel in the date but in last row Same A is on another travel very next day. I need to remove this.
Expected output is
Name Dep_station Arr_station Dep_time Arr_time
0 A Delhi Bangalore 02-01-2022 10:00:00 04-01-2022 22:00:00
1 B Bangalore Chennai 02-02-2022 10:00:00 04-02-2022 19:00:00
2 C Chennai Delhi 02-03-2022 10:00:00 05-03-2022 18:00:00
All help and suggestion would be helpful. Thanks in Advance.
CodePudding user response:
Not positive if I understand the conditions correctly here, but....
If the criteria is that we want to remove rows where:
- Name, Dep_Station, and Arr_station are all the same
- Dep_time is on a date that is a single day after another entry (with matching above items)
- Dep_time is at a time that is the same as the other entry
If that's our criteria, we can sort the table in the right order and then use df.shift()
to compare against the previous row. We will need to make sure that our Dep_time column is a datetime so we can parse the dates and times from it
import datetime # this is needed to adjust dates
df = pd.DataFrame({
"Name":["A","B","C","A"],
"Dep_station":["Delhi","Bangalore","Chennai","Delhi"],
"Arr_station":["Bangalore","Chennai","Delhi","Bangalore"],
"Dep_time":["02-01-2022 10:00:00", "02-02-2022 10:00:00", "02-03-2022 10:00:00", "03-01-2022 10:00:00"],
"Arr_time":["04-01-2022 22:00:00", "04-02-2022 19:00:00", "05-03-2022 18:00:00", "05-01-2022 22:00:00"]
})
df['Dep_time'] = pd.to_datetime(df['Dep_time']) # may want to do same from Arr_time
df = df.sort_values(['Name','Dep_station','Arr_station','Dep_time'])
indexes_of_dupes = df[
(df['Dep_time'].dt.date == df['Dep_time'].shift().dt.date datetime.timedelta(days=1))
& (df['Dep_time'].dt.time == df['Dep_time'].shift().dt.time)].index
# The above has two comparisons within it: first is the comparison to on the row
# to see if the date of the row is one different (the timedelta(days=1))
# and then to see if the time-of-day is the same
df = df.drop(indexes_of_dupes)
Hope this at least gets you on the right track!
CodePudding user response:
I think you can use subset
parameter from pandas.drop_duplicates
:
df = df.drop_duplicates(subset=['Name', 'Dep_station', 'Arr_station'])
Hope I get you right