Python filter row with multiple columns conditions-CodePudding

I have a CSV dataset and I need to filter it with conditions but the problem that the condition can be true for multiple days. What I want is to keep the last true value for this condition.

My dateset looks like this

Date           City             Summary                  No.
2-18-2019       NY            Airplane land              23
2-18-2019     London          Cargo handling              4
2-18-2019      Dubai          Airplane land              92
2-19-2019      Dubai          Airplane stay              92
2-19-2019      Paris          Flight canceled            78
2-19-2019       LA            Airplane Land              7
2-20-2019      Dubai          Airplane land              92
2-20-2019       LA            Airplane land              29
2-20-2019       NY            Airplane left              23
2-21-2019      Paris          Airplane reschedule        78
2-21-2019      London         Airplane land              4
2-21-2019       LA            Airplane from NY land      29
~~~
3-10-2019      London         Airplane land              5
3-10-2019      Paris          Airplane Land              78
3-10-2019       LA            Reschedule                 29
3-11-2019       NY            Cargo handled              23
3-11-2019      Dubai          Arrived be4 2 days         34
~~~
3-21-2019      Dubai          Airplane land              92
3-21-2019     New Delhi       Reschedule                 9
3-21-2019      London         Cargo handling             5
3-22-2019     New Delhi       Airplane Land              9
3-22-2019       NY            Reschedule                 23
3-22-2019      Dubai          Airplane land              92

So the code should give us last entry for plane landing where City == City and No. == No. and as you can see this condition can be true for multiple days. what I want is to check if condition is true for two days then keep the last.

The desired output should look like the dataset below:

Date           City             Summary                  No.
2-18-2019       NY            Airplane land              23
2-19-2019       LA            Airplane Land              7
2-20-2019      Dubai          Airplane land              92
2-21-2019      London         Airplane land              4
2-21-2019       LA            Airplane from NY land      29
~~~
3-10-2019      London         Airplane land              5
3-10-2019      Paris          Airplane Land              78
~~~
3-21-2019      Dubai          Airplane land              92
3-22-2019     New Delhi       Airplane Land              9
3-22-2019      Dubai          Airplane land              35

My code is below but it doesn't


import pandas as pd
import openpyxl
import numpy as np
import io
from datetime import timedelta

df = pd.read_csv(r"C:\Airplanes.csv")

pd.set_option('display.max_columns', 500)
df = df.astype(str)



count = df.groupby(['City', 'No.'])['No.'].transform('size')



df['Date'] = pd.to_datetime(df['Date'])

df = df[(df.Summary.str.contains('Airplane ') & df.Summary.str.contains('Land'))]


def filter(grp):
    a = grp.Date   timedelta(days=2)
    return grp[~grp.Date.isin(a)]

df.groupby(['City']).apply(filter).reset_index(drop=True)


export_excel = df.to_excel(r'C:\MS.xlsx', index=None, header=True)

Please help to fix it

CodePudding user response：

I think you need:

#convert to datetimes
df['Date'] = pd.to_datetime(df['Date'])

#filter case non sensitive
df=df[(df.Summary.str.contains('Airplane ') & df.Summary.str.contains('Land', case=False))]

#mask for match if exist dates with subtract one day
m = df['Date'].isin(df['Date'] - pd.Timedelta(days=1))

#filter out duplicates if exist previous days
df = df[(m & ~df['Date'].duplicated()) | ~m]
print (df)
         Date       City                Summary  No.
0  2019-02-18         NY          Airplane land   23
5  2019-02-19         LA          Airplane Land    7
6  2019-02-20      Dubai          Airplane land   92
10 2019-02-21     London          Airplane land    4
11 2019-02-21         LA  Airplane from NY land   29
12 2019-03-10     London          Airplane land    5
13 2019-03-10      Paris          Airplane Land   78
17 2019-03-21      Dubai          Airplane land   92
20 2019-03-22  New Delhi          Airplane Land    9
22 2019-03-22      Dubai          Airplane land   92

CodePudding user response：

You can simple do it using drop_duplicates([col1, col2], keep='last'), as follows:

import pandas as pd

df = pd.read_csv('sample.csv')

df = df.drop_duplicates(['City', 'No.'], keep='last')

print(df)

# RESULT
         Date    City               Summary  No.
5   2-19-2019      LA       Airplane Land      7
10  2-21-2019  London       Airplane land      4
13  3-10-2019   Paris       Airplane Land     78
14  3-10-2019      LA         Reschedule      29
16  3-11-2019   Dubai    Arrived be4 2 days   34
19  3-21-2019  London      Cargo handling      5
20  3-22-2019     New  Delhi Airplane Land     9
21  3-22-2019      NY         Reschedule      23
22  3-22-2019   Dubai       Airplane land     92