Home > Blockchain >  Pandas - remove wrong entry
Pandas - remove wrong entry

Time:10-07

I have dataset which has list of transition details. There are wrong entry which has same or ongoing transition date that should not happened which need to remove.

df = pd.DataFrame({

"Name":["A","B","C","A"],
"Dep_station":["Delhi","Bangalore","Chennai","Delhi"],
"Arr_station":["Bangalore","Chennai","Delhi","Bangalore"],
"Dep_time":["02-01-2022 10:00:00", "02-02-2022 10:00:00", "02-03-2022 10:00:00", "03-01-2022 10:00:00"],
"Arr_time":["04-01-2022 22:00:00", "04-02-2022 19:00:00", "05-03-2022 18:00:00", "05-01-2022 22:00:00"]
})

df looks like this -

        Name    Dep_station Arr_station Dep_time    Arr_time
0   A   Delhi   Bangalore   02-01-2022 10:00:00 04-01-2022 22:00:00
1   B   Bangalore   Chennai 02-02-2022 10:00:00 04-02-2022 19:00:00
2   C   Chennai Delhi   02-03-2022 10:00:00 05-03-2022 18:00:00
3   A   Delhi   Bangalore   03-01-2022 10:00:00 05-01-2022 22:00:00

If you notice this A is already on travel in the date but in last row Same A is on another travel very next day. I need to remove this.

Expected output is

        Name    Dep_station Arr_station Dep_time    Arr_time
0   A   Delhi   Bangalore   02-01-2022 10:00:00 04-01-2022 22:00:00
1   B   Bangalore   Chennai 02-02-2022 10:00:00 04-02-2022 19:00:00
2   C   Chennai Delhi   02-03-2022 10:00:00 05-03-2022 18:00:00

All help and suggestion would be helpful. Thanks in Advance.

CodePudding user response:

Not positive if I understand the conditions correctly here, but....

If the criteria is that we want to remove rows where:

  • Name, Dep_Station, and Arr_station are all the same
  • Dep_time is on a date that is a single day after another entry (with matching above items)
  • Dep_time is at a time that is the same as the other entry

If that's our criteria, we can sort the table in the right order and then use df.shift() to compare against the previous row. We will need to make sure that our Dep_time column is a datetime so we can parse the dates and times from it

import datetime # this is needed to adjust dates

df = pd.DataFrame({
    "Name":["A","B","C","A"],
    "Dep_station":["Delhi","Bangalore","Chennai","Delhi"],
    "Arr_station":["Bangalore","Chennai","Delhi","Bangalore"],
    "Dep_time":["02-01-2022 10:00:00", "02-02-2022 10:00:00", "02-03-2022 10:00:00", "03-01-2022 10:00:00"],
    "Arr_time":["04-01-2022 22:00:00", "04-02-2022 19:00:00", "05-03-2022 18:00:00", "05-01-2022 22:00:00"]
})

df['Dep_time'] = pd.to_datetime(df['Dep_time']) # may want to do same from Arr_time
df = df.sort_values(['Name','Dep_station','Arr_station','Dep_time'])
indexes_of_dupes = df[
    (df['Dep_time'].dt.date == df['Dep_time'].shift().dt.date   datetime.timedelta(days=1)) 
    & (df['Dep_time'].dt.time == df['Dep_time'].shift().dt.time)].index
# The above has two comparisons within it: first is the comparison to on the row
# to see if the date of the row is one different (the   timedelta(days=1))
# and then to see if the time-of-day is the same
df = df.drop(indexes_of_dupes)

Hope this at least gets you on the right track!

CodePudding user response:

I think you can use subset parameter from pandas.drop_duplicates:

df = df.drop_duplicates(subset=['Name', 'Dep_station', 'Arr_station'])

Hope I get you right

  • Related