Home > Software engineering >  Insert Column into a Dataframe depending on dates from another Dataframe
Insert Column into a Dataframe depending on dates from another Dataframe

Time:12-15

I have two dataframes:

DF1:

        Date           ID 
---------------------------------------   
12-03-2020 01:10:00    AAA   
22-04-2020 02:10:00    BBB
02-02-2020 13:55:00    CCC
15-11-2020 05:20:00    DDD
19-08-2020 14:30:00    EEE
23-04-2020 11:11:00    FFF
30-07-2020 21:40:00    GGG
11-10-2020 01:10:00    HHH
07-03-2020 01:10:00    III
06-01-2020 01:10:00    JJJ

DF2

    Start Date            End Date      ID 
---------------------------------------------   
12-03-2020 01:00:00 12-03-2020 02:00:00 AAA   
22-04-2020 02:00:00 22-04-2020 02:30:00 BBB
02-02-2020 13:00:00 02-02-2020 14:00:00 JJJ
15-11-2020 05:00:00 15-11-2020 05:10:00 DDD
30-07-2020 21:50:00 30-07-2020 21:55:00 EEE

What I need is to generate a column in DF1 called "Status", which says "Yes" or "No" depending on whether the "Date and Time" of df1 is between the start date and end date of df2 and additional to this , the ID of df1 must match the ID of df2, for example, the result would be:

Fecha y Hora           ID    Estatus
--------------------------------- 
12-03-2020 01:10:00   AAA     Yes
22-04-2020 02:10:00   BBB     Yes
02-02-2020 13:55:00   CCC     No
15-11-2020 05:20:00   DDD     No
19-08-2020 14:30:00   EEE     Yes
23-04-2020 11:11:00   FFF     No
30-07-2020 21:40:00   GGG     No
11-10-2020 01:10:00   HHH     No
07-03-2020 01:10:00   III     No
06-01-2020 01:10:00   JJJ     No

I have tried the following but I am missing the date condition and I don't know how to add it:

df1["status"] = df1.apply(lambda x: "si" if df2["ID"].isin(x).any() else "no",axis=1)

Any suggestions?

CodePudding user response:

The best way is first to merge:

df1 = df1.merge(right=df2, on='ID')

Then create status:

import numpy as np
df1['status'] = np.where(df1['Date']>df1['Start Date'] & df1['Date']<df1['End Date'], 'Yes', 'No')

Thereafter drop the unneeded columns in df1:

df1 = df1.drop(columns=['Start Date', 'End Date'])

CodePudding user response:

If ID is not unique, just merging is not enough.

To demonstrate, I added 2 records for df2.

df2 = pd.DataFrame([
  ['12-03-2020 01:00:00', '12-03-2020 02:00:00', 'AAA'],
  ['22-04-2020 02:00:00', '22-04-2020 02:30:00', 'BBB'],
  ['02-02-2020 13:00:00', '02-02-2020 14:00:00', 'JJJ'],
  ['15-11-2020 05:00:00', '15-11-2020 05:10:00', 'DDD'],
  ['30-07-2020 21:50:00', '30-07-2020 21:55:00', 'EEE'],
  ['06-03-2020 01:10:00', '06-03-2020 01:10:00', 'III'],  # added
  ['07-03-2020 01:10:00', '08-03-2020 01:10:00', 'III'],  # added
], columns=['Start Date', 'End Date', 'ID'])

First merge df2 with left join.

# Convert all datetime fields to datetime type if not yet
# ex: df1['Date'] = pd.to_datetime(df1.Date, dayfirst=True)

df = df1.merge(df2, on='ID', how='left')

#                Date  ID          Start Date            End Date
# 2020-03-12 01:10:00 AAA 2020-03-12 01:00:00 2020-03-12 02:00:00
# 2020-08-19 14:30:00 EEE 2020-07-30 21:50:00 2020-07-30 21:55:00
# ...
# 2020-10-11 01:10:00 HHH                 NaT                 NaT
# ...
# 2020-03-07 01:10:00 III 2020-03-06 01:10:00 2020-03-06 01:10:00
# 2020-03-07 01:10:00 III 2020-03-07 01:10:00 2020-03-08 01:10:00

# I have multiple entries for the duplicated III records in df2

Then aggregate if any Date are in between Start Date and End Date.

# by default between is inclusive. If you want to pass inclusive parameter.
df['Estatus'] = df.Date.between(df['Start Date'], df['End Date'])
df = df.groupby(['ID', 'Date']).EStatus.any().reset_index()

Result

     ID                Date Estatus
0   AAA 2020-03-12 01:10:00    True
1   BBB 2020-04-22 02:10:00    True
2   CCC 2020-02-02 13:55:00   False
3   DDD 2020-11-15 05:20:00   False
4   EEE 2020-08-19 14:30:00   False
5   FFF 2020-04-23 11:11:00   False
6   GGG 2020-07-30 21:40:00   False
7   HHH 2020-10-11 01:10:00   False
8   III 2020-03-07 01:10:00    True
9   JJJ 2020-01-06 01:10:00   False
  • Related