have a CSV dataset and I need to filter it with conditions but the problem is that the condition can be true for multiple days. What I want is to keep the last true value for these conditions within 3 days.
My dataset looks like this
Date City Summary Flight No. Company
2-18-2019 NY Airplane land 23 Delta
2-18-2019 London Cargo handling 4 British
2-18-2019 Dubai Airplane land 92 Emirates
2-19-2019 Dubai Airplane stay 92 Emirates
2-19-2019 Paris Flight canceled 78 British
2-19-2019 LA Airplane Land 7 United
2-20-2019 Dubai Airplane land 92 Emirates
2-20-2019 LA Airplane land 29 Delta
2-20-2019 NY Airplane left 23 Delta
2-21-2019 Paris Airplane reschedule 78 British
2-21-2019 London Airplane land 4 British
2-21-2019 LA Airplane from NY land 29 Delta
~~~
3-10-2019 London Airplane land 5 KLM
3-10-2019 Paris Airplane Land 78 Air France
3-10-2019 LA Reschedule 29 United
3-11-2019 NY Cargo handled 23 Delta
3-11-2019 Dubai Arrived be4 2 days 34 Etihad
~~~
3-21-2019 Dubai Airplane land 92 Etihad
3-21-2019 New Delhi Reschedule 9 AirAsia
3-21-2019 London Cargo handling 5 Lufthansa
3-22-2019 New Delhi Airplane Land 9 AirAsia
3-22-2019 NY Reschedule 23 United
3-22-2019 Dubai Airplane land 35 Etihad
The code should check if Summary.str.contains('Airplane ') & df.Summary.str.contains('Land') and if City == City and Flight No. == Flight No and Company == Company then return last entire within three days . So if all conditions are true on the 18th and 20th the code should return the 20 only. but if it is true for the 18th and 21th it should keep both. Please note that not all columns with have the same data( not a duplicated rows)
The desired output should look like the dataset below:
Date City Summary Flight No. Company
2-18-2019 NY Airplane land 23 Delta
2-19-2019 LA Airplane Land 7 United
2-20-2019 Dubai Airplane land 92 Emirates
2-21-2019 London Airplane land 4 British
2-21-2019 LA Airplane from NY land 29 Delta
~~~
3-10-2019 London Airplane land 5 KLM
3-10-2019 Paris Airplane Land 78 Air France
~~~
3-21-2019 Dubai Airplane land 92 Etihad
3-22-2019 New Delhi Airplane Land 9 AirAsia
3-22-2019 Dubai Airplane land 35 Etihad
My code is below but it doesn't work
import pandas as pd
import openpyxl
import numpy as np
import io
from datetime import timedelta
df = pd.read_csv(r"C:\Airplanes.csv")
pd.set_option('display.max_columns', 500)
df = df.astype(str)
count = df.groupby(['City', 'Flight No.'])['No.'].transform('size')
df['Date'] = pd.to_datetime(df['Date'])
df = df[(df.Summary.str.contains('Airplane ') & df.Summary.str.contains('Land'))]
def filter(grp):
a = grp.Date timedelta(days=2)
return grp[~grp.Date.isin(a)]
df = np.where((df['City'] == df['City']) & (df['Company'] == df['Company']) & (df['Flight No.'] == df['Flight No.']).apply(filter).reset_index(drop=True))
export_excel = df.to_excel(r'C:\MS.xlsx', index=None, header=True)
It return the below error
AttributeError: 'bool' object has no attribute 'Date'
Please help me find a what to apply all condition and keep the last True entrie within specific days.
CodePudding user response:
First, the condition you use inside np.where will always be True. And, it is unclear from the rest of the code provided what the columns 'Rig' and 'LinerSize'.
Your use of np.where returns a tuple (array([0, 1, 2], dtype=int64),)
and the conditions inside are always True since we'll always have df['Rig'] == df['Rig']
etc. A common use of np.where would be to specify in addition a couple of values: one in case of True on your condition and the other in the False case. Yet, this will return a Series, not the full data frame, to which you are trying to apply the filter function.
I suggest to use the filter like:
city_list = ['NY', 'LA'] # just an example
company_list = ['Delta', 'United']
flight_list = [23, 7, 92]
df_new = [(df['City'].isin(city_list)) &
(df['Company'].isin(company_list)) &
(df['Flight No'].isin(flight_list)]
That should help you get closer to what you want
CodePudding user response:
First, we filter the DataFrame using contains
on airplane
and land
, all in lower case, as you did :
df_clean = df[(df['Summary'].str.lower().str.contains('airplane')) & (df['Summary'].str.lower().str.contains('land'))]
df_clean = df_clean.reset_index(drop=True)
Then we manage the date diff using duplicated
and diff
on a datetime Date
column format and sort the values to get the last valid entry to get the expected result :
df_clean['date_dt'] = pd.to_datetime(df_clean['Date'], format="%m-%d-%Y")
c = ['City', 'Flight No.', 'Company']
def f(x):
return (x[c].duplicated() & x['date_dt'].diff().dt.days.lt(4)).sort_values(ascending=False)
df_clean = df_clean.sort_values(c)
res = df_clean[~df_clean.groupby(c).apply(f).values]
res.sort_values('Date')
Output :
Date City Summary Flight No. Company date_dt
0 2-18-2019 NY Airplane land 23 Delta 2019-02-18
2 2-19-2019 LA Airplane Land 7 United 2019-02-19
3 2-20-2019 Dubai Airplane land 92 Emirates 2019-02-20
6 2-21-2019 LA Airplane from NY land 29 Delta 2019-02-21
5 2-21-2019 London Airplane land 4 British 2019-02-21
7 3-10-2019 London Airplane land 5 KLM 2019-03-10
8 3-10-2019 Paris Airplane Land 78 Air France 2019-03-10
9 3-21-2019 Dubai Airplane land 92 Etihad 2019-03-21
10 3-22-2019 Dubai Airplane land 35 Etihad 2019-03-22
11 3-22-2019 New Delhi Airplane Land 9 AirAsia 2019-03-22