I have a CSV dataset and I need to filter it with conditions but the problem that the condition can be true for multiple days. What I want is to keep the last true value for this condition.
My dateset looks like this
Date City Summary No.
2-18-2019 NY Airplane land 23
2-18-2019 London Cargo handling 4
2-18-2019 Dubai Airplane land 92
2-19-2019 Dubai Airplane stay 92
2-19-2019 Paris Flight canceled 78
2-19-2019 LA Airplane Land 7
2-20-2019 Dubai Airplane land 92
2-20-2019 LA Airplane land 29
2-20-2019 NY Airplane left 23
2-21-2019 Paris Airplane reschedule 78
2-21-2019 London Airplane land 4
2-21-2019 LA Airplane from NY land 29
~~~
3-10-2019 London Airplane land 5
3-10-2019 Paris Airplane Land 78
3-10-2019 LA Reschedule 29
3-11-2019 NY Cargo handled 23
3-11-2019 Dubai Arrived be4 2 days 34
~~~
3-21-2019 Dubai Airplane land 92
3-21-2019 New Delhi Reschedule 9
3-21-2019 London Cargo handling 5
3-22-2019 New Delhi Airplane Land 9
3-22-2019 NY Reschedule 23
3-22-2019 Dubai Airplane land 92
So the code should give us last entry for plane landing where City == City and No. == No. and as you can see this condition can be true for multiple days. what I want is to check if condition is true for two days then keep the last.
The desired output should look like the dataset below:
Date City Summary No.
2-18-2019 NY Airplane land 23
2-19-2019 LA Airplane Land 7
2-20-2019 Dubai Airplane land 92
2-21-2019 London Airplane land 4
2-21-2019 LA Airplane from NY land 29
~~~
3-10-2019 London Airplane land 5
3-10-2019 Paris Airplane Land 78
~~~
3-21-2019 Dubai Airplane land 92
3-22-2019 New Delhi Airplane Land 9
3-22-2019 Dubai Airplane land 35
My code is below but it doesn't
import pandas as pd
import openpyxl
import numpy as np
import io
from datetime import timedelta
df = pd.read_csv(r"C:\Airplanes.csv")
pd.set_option('display.max_columns', 500)
df = df.astype(str)
count = df.groupby(['City', 'No.'])['No.'].transform('size')
df['Date'] = pd.to_datetime(df['Date'])
df = df[(df.Summary.str.contains('Airplane ') & df.Summary.str.contains('Land'))]
def filter(grp):
a = grp.Date timedelta(days=2)
return grp[~grp.Date.isin(a)]
df.groupby(['City']).apply(filter).reset_index(drop=True)
export_excel = df.to_excel(r'C:\MS.xlsx', index=None, header=True)
Please help to fix it
CodePudding user response:
I think you need:
#convert to datetimes
df['Date'] = pd.to_datetime(df['Date'])
#filter case non sensitive
df=df[(df.Summary.str.contains('Airplane ') & df.Summary.str.contains('Land', case=False))]
#mask for match if exist dates with subtract one day
m = df['Date'].isin(df['Date'] - pd.Timedelta(days=1))
#filter out duplicates if exist previous days
df = df[(m & ~df['Date'].duplicated()) | ~m]
print (df)
Date City Summary No.
0 2019-02-18 NY Airplane land 23
5 2019-02-19 LA Airplane Land 7
6 2019-02-20 Dubai Airplane land 92
10 2019-02-21 London Airplane land 4
11 2019-02-21 LA Airplane from NY land 29
12 2019-03-10 London Airplane land 5
13 2019-03-10 Paris Airplane Land 78
17 2019-03-21 Dubai Airplane land 92
20 2019-03-22 New Delhi Airplane Land 9
22 2019-03-22 Dubai Airplane land 92
CodePudding user response:
You can simple do it using drop_duplicates([col1, col2], keep='last')
, as follows:
import pandas as pd
df = pd.read_csv('sample.csv')
df = df.drop_duplicates(['City', 'No.'], keep='last')
print(df)
# RESULT
Date City Summary No.
5 2-19-2019 LA Airplane Land 7
10 2-21-2019 London Airplane land 4
13 3-10-2019 Paris Airplane Land 78
14 3-10-2019 LA Reschedule 29
16 3-11-2019 Dubai Arrived be4 2 days 34
19 3-21-2019 London Cargo handling 5
20 3-22-2019 New Delhi Airplane Land 9
21 3-22-2019 NY Reschedule 23
22 3-22-2019 Dubai Airplane land 92