Home > Mobile >  Python filter with multiple conditions for string and date columns
Python filter with multiple conditions for string and date columns

Time:12-20

have a CSV dataset and I need to filter it with conditions but the problem is that the condition can be true for multiple days. What I want is to keep the last true value for these conditions within 3 days.

My dataset looks like this

Date           City             Summary              Flight No.    Company
2-18-2019       NY            Airplane land              23         Delta 
2-18-2019     London          Cargo handling              4         British
2-18-2019      Dubai          Airplane land              92         Emirates
2-19-2019      Dubai          Airplane stay              92         Emirates
2-19-2019      Paris          Flight canceled            78         British
2-19-2019       LA            Airplane Land              7          United
2-20-2019      Dubai          Airplane land              92         Emirates
2-20-2019       LA            Airplane land              29         Delta
2-20-2019       NY            Airplane left              23         Delta
2-21-2019      Paris          Airplane reschedule        78         British
2-21-2019      London         Airplane land              4          British
2-21-2019       LA            Airplane from NY land      29         Delta
~~~
3-10-2019      London         Airplane land              5          KLM
3-10-2019      Paris          Airplane Land              78       Air France
3-10-2019       LA            Reschedule                 29         United
3-11-2019       NY            Cargo handled              23         Delta
3-11-2019      Dubai          Arrived be4 2 days         34         Etihad
~~~
3-21-2019      Dubai          Airplane land              92         Etihad
3-21-2019     New Delhi       Reschedule                 9          AirAsia
3-21-2019      London         Cargo handling             5         Lufthansa
3-22-2019     New Delhi       Airplane Land              9          AirAsia
3-22-2019       NY            Reschedule                 23         United
3-22-2019      Dubai          Airplane land              35         Etihad

The code should check if Summary.str.contains('Airplane ') & df.Summary.str.contains('Land') and if City == City and Flight No. == Flight No and Company == Company then return last entire within three days . So if all conditions are true on the 18th and 20th the code should return the 20 only. but if it is true for the 18th and 21th it should keep both. Please note that not all columns with have the same data( not a duplicated rows)

The desired output should look like the dataset below:


Date           City             Summary              Flight No.    Company
2-18-2019       NY            Airplane land              23         Delta 
2-19-2019       LA            Airplane Land              7          United
2-20-2019      Dubai          Airplane land              92         Emirates
2-21-2019      London         Airplane land              4          British
2-21-2019       LA            Airplane from NY land      29         Delta
~~~
3-10-2019      London         Airplane land              5          KLM
3-10-2019      Paris          Airplane Land              78       Air France
~~~
3-21-2019      Dubai          Airplane land              92         Etihad
3-22-2019     New Delhi       Airplane Land              9          AirAsia
3-22-2019      Dubai          Airplane land              35         Etihad

My code is below but it doesn't work

import pandas as pd
import openpyxl
import numpy as np
import io
from datetime import timedelta

df = pd.read_csv(r"C:\Airplanes.csv")

pd.set_option('display.max_columns', 500)
df = df.astype(str)



count = df.groupby(['City', 'Flight No.'])['No.'].transform('size')



df['Date'] = pd.to_datetime(df['Date'])

df = df[(df.Summary.str.contains('Airplane ') & df.Summary.str.contains('Land'))]


def filter(grp):
    a = grp.Date   timedelta(days=2)
    return grp[~grp.Date.isin(a)]

df = np.where((df['City'] == df['City']) & (df['Company'] == df['Company']) & (df['Flight No.'] == df['Flight No.']).apply(filter).reset_index(drop=True))


export_excel = df.to_excel(r'C:\MS.xlsx', index=None, header=True)

It return the below error

AttributeError: 'bool' object has no attribute 'Date'

Please help me find a what to apply all condition and keep the last True entrie within specific days.

CodePudding user response:

First, the condition you use inside np.where will always be True. And, it is unclear from the rest of the code provided what the columns 'Rig' and 'LinerSize'. Your use of np.where returns a tuple (array([0, 1, 2], dtype=int64),) and the conditions inside are always True since we'll always have df['Rig'] == df['Rig'] etc. A common use of np.where would be to specify in addition a couple of values: one in case of True on your condition and the other in the False case. Yet, this will return a Series, not the full data frame, to which you are trying to apply the filter function. I suggest to use the filter like:

city_list = ['NY', 'LA'] # just an example
company_list = ['Delta', 'United']
flight_list = [23, 7, 92]
df_new = [(df['City'].isin(city_list)) &
          (df['Company'].isin(company_list)) &
          (df['Flight No'].isin(flight_list)]

That should help you get closer to what you want

CodePudding user response:

First, we filter the DataFrame using contains on airplane and land, all in lower case, as you did :

df_clean = df[(df['Summary'].str.lower().str.contains('airplane')) & (df['Summary'].str.lower().str.contains('land'))]
df_clean = df_clean.reset_index(drop=True)

Then we manage the date diff using duplicated and diff on a datetime Date column format and sort the values to get the last valid entry to get the expected result :

df_clean['date_dt'] = pd.to_datetime(df_clean['Date'], format="%m-%d-%Y")

c = ['City', 'Flight No.', 'Company']

def f(x):
    return (x[c].duplicated() & x['date_dt'].diff().dt.days.lt(4)).sort_values(ascending=False)

df_clean = df_clean.sort_values(c)
res = df_clean[~df_clean.groupby(c).apply(f).values]
res.sort_values('Date')

Output :

    Date        City        Summary                 Flight No.  Company     date_dt
0   2-18-2019   NY          Airplane land           23          Delta       2019-02-18
2   2-19-2019   LA          Airplane Land           7           United      2019-02-19
3   2-20-2019   Dubai       Airplane land           92          Emirates    2019-02-20
6   2-21-2019   LA          Airplane from NY land   29          Delta       2019-02-21
5   2-21-2019   London      Airplane land           4           British     2019-02-21
7   3-10-2019   London      Airplane land           5           KLM         2019-03-10
8   3-10-2019   Paris       Airplane Land           78          Air France  2019-03-10
9   3-21-2019   Dubai       Airplane land           92          Etihad      2019-03-21
10  3-22-2019   Dubai       Airplane land           35          Etihad      2019-03-22
11  3-22-2019   New Delhi   Airplane Land           9           AirAsia     2019-03-22
  • Related