I have two dataframes:
DF1:
Date ID
---------------------------------------
12-03-2020 01:10:00 AAA
22-04-2020 02:10:00 BBB
02-02-2020 13:55:00 CCC
15-11-2020 05:20:00 DDD
19-08-2020 14:30:00 EEE
23-04-2020 11:11:00 FFF
30-07-2020 21:40:00 GGG
11-10-2020 01:10:00 HHH
07-03-2020 01:10:00 III
06-01-2020 01:10:00 JJJ
DF2
Start Date End Date ID
---------------------------------------------
12-03-2020 01:00:00 12-03-2020 02:00:00 AAA
22-04-2020 02:00:00 22-04-2020 02:30:00 BBB
02-02-2020 13:00:00 02-02-2020 14:00:00 JJJ
15-11-2020 05:00:00 15-11-2020 05:10:00 DDD
30-07-2020 21:50:00 30-07-2020 21:55:00 EEE
What I need is to generate a column in DF1 called "Status", which says "Yes" or "No" depending on whether the "Date and Time" of df1 is between the start date and end date of df2 and additional to this , the ID of df1 must match the ID of df2, for example, the result would be:
Fecha y Hora ID Estatus
---------------------------------
12-03-2020 01:10:00 AAA Yes
22-04-2020 02:10:00 BBB Yes
02-02-2020 13:55:00 CCC No
15-11-2020 05:20:00 DDD No
19-08-2020 14:30:00 EEE Yes
23-04-2020 11:11:00 FFF No
30-07-2020 21:40:00 GGG No
11-10-2020 01:10:00 HHH No
07-03-2020 01:10:00 III No
06-01-2020 01:10:00 JJJ No
I have tried the following but I am missing the date condition and I don't know how to add it:
df1["status"] = df1.apply(lambda x: "si" if df2["ID"].isin(x).any() else "no",axis=1)
Any suggestions?
CodePudding user response:
The best way is first to merge:
df1 = df1.merge(right=df2, on='ID')
Then create status:
import numpy as np
df1['status'] = np.where(df1['Date']>df1['Start Date'] & df1['Date']<df1['End Date'], 'Yes', 'No')
Thereafter drop the unneeded columns in df1:
df1 = df1.drop(columns=['Start Date', 'End Date'])
CodePudding user response:
If ID
is not unique, just merging is not enough.
To demonstrate, I added 2 records for df2.
df2 = pd.DataFrame([
['12-03-2020 01:00:00', '12-03-2020 02:00:00', 'AAA'],
['22-04-2020 02:00:00', '22-04-2020 02:30:00', 'BBB'],
['02-02-2020 13:00:00', '02-02-2020 14:00:00', 'JJJ'],
['15-11-2020 05:00:00', '15-11-2020 05:10:00', 'DDD'],
['30-07-2020 21:50:00', '30-07-2020 21:55:00', 'EEE'],
['06-03-2020 01:10:00', '06-03-2020 01:10:00', 'III'], # added
['07-03-2020 01:10:00', '08-03-2020 01:10:00', 'III'], # added
], columns=['Start Date', 'End Date', 'ID'])
First merge df2 with left join.
# Convert all datetime fields to datetime type if not yet
# ex: df1['Date'] = pd.to_datetime(df1.Date, dayfirst=True)
df = df1.merge(df2, on='ID', how='left')
# Date ID Start Date End Date
# 2020-03-12 01:10:00 AAA 2020-03-12 01:00:00 2020-03-12 02:00:00
# 2020-08-19 14:30:00 EEE 2020-07-30 21:50:00 2020-07-30 21:55:00
# ...
# 2020-10-11 01:10:00 HHH NaT NaT
# ...
# 2020-03-07 01:10:00 III 2020-03-06 01:10:00 2020-03-06 01:10:00
# 2020-03-07 01:10:00 III 2020-03-07 01:10:00 2020-03-08 01:10:00
# I have multiple entries for the duplicated III records in df2
Then aggregate if any Date
are in between Start Date
and End Date
.
# by default between is inclusive. If you want to pass inclusive parameter.
df['Estatus'] = df.Date.between(df['Start Date'], df['End Date'])
df = df.groupby(['ID', 'Date']).EStatus.any().reset_index()
Result
ID Date Estatus
0 AAA 2020-03-12 01:10:00 True
1 BBB 2020-04-22 02:10:00 True
2 CCC 2020-02-02 13:55:00 False
3 DDD 2020-11-15 05:20:00 False
4 EEE 2020-08-19 14:30:00 False
5 FFF 2020-04-23 11:11:00 False
6 GGG 2020-07-30 21:40:00 False
7 HHH 2020-10-11 01:10:00 False
8 III 2020-03-07 01:10:00 True
9 JJJ 2020-01-06 01:10:00 False