Home > OS >  Compare and loop every row of in between time in a second dataframe, if match copy values from colum
Compare and loop every row of in between time in a second dataframe, if match copy values from colum

Time:11-03

I have two DFs:

  • DF1 has a column for time
MsgTime 
13:45:33
14:13:25
15:16:43
16:51:19
  • DF2 has two columns of time, and additional two columns of information
MsgTime1 | MgsTime2 | Temperature | Humidity
13:40:33   13:50:13    21           45
14:16:43   14:26:43    22           56
16:49:11   16:59:02    32           40

So if MsgTime is in between MsgTime1 and MsgTime2 from DF2, copy the Temperature and Humidity to DF1 as output. The loop should compare every row to find a match, if not leave it blank. A desired output should look like this:

MgsTime | Temperature | Humidity
13:45:33   21            45
14:13:25   NaN           NaN
15:16:43   NaN           NaN
16:51:19    32           40

I tried to do like a double for loop, but it didn't seem to work:

for i, row in DF1.iterrows():
    for j, row2 in DF2.iterrows():
        if (row2['MsgTime1'] <= row['MsgTime']) and (row['MsgTime'] <=  row2['MsgTime2']):
            row['Temperature'] = row2['Temperature']
            row['Humidity'] = row2['Humidity']
        else:
            row2  = 1

CodePudding user response:

One way using pandas.Interval with pandas.to_datetime:

df2.index = df2[["MsgTime1", "MsgTime2"]].apply(lambda x: pd.Interval(*pd.to_datetime(x)), axis=1)

s = pd.to_datetime(df["MsgTime"])
for k in ["Temperature", "Humidity"]:
    df[k] = s.map(df2[k])
print(df)

Output:

    MsgTime  Temperature  Humidity
0  13:45:33         21.0      45.0
1  14:13:25          NaN       NaN
2  15:16:43          NaN       NaN
3  16:51:19         32.0      40.0

CodePudding user response:

A package piso (pandas interval set operations), provides a very fast implementation for this

df1 = pd.DataFrame(
    pd.to_datetime(["13:45:33", "14:13:25", "15:16:43", "16:51:19"]),
    columns=["MsgTime"],
)

df2 = pd.DataFrame(
    {
        "MsgTime1": pd.to_datetime(["13:40:33", "14:16:43", "16:49:11"]),
        "MsgTime2": pd.to_datetime(["13:50:13", "14:26:43", "16:59:02"]),
        "Temperature":[21,22,32],
        "Humidity":[45,56,40],
    }
)

solution

Create a dataframe with Temperature and Humidity columns, indexed by a pandas.IntervalIndex

df3 = df2[["Temperature", "Humidity"]].set_index(pd.IntervalIndex.from_arrays(df2["MsgTime1"], df2["MsgTime2"]))

df3 looks like this

                                            Temperature  Humidity
(2021-11-03 13:40:33, 2021-11-03 13:50:13]           21        45
(2021-11-03 14:16:43, 2021-11-03 14:26:43]           22        56
(2021-11-03 16:49:11, 2021-11-03 16:59:02]           32        40

Note since no date component was provided it just assumes today's date. You could alternatively use pandas.Timedelta instead of pandas.Timestamp - the approach will work the same

Next use piso.lookup

piso.lookup(df3, df1["MsgTime"])

which produces

                     Temperature  Humidity
2021-11-03 13:45:33         21.0      45.0
2021-11-03 14:13:25          NaN       NaN
2021-11-03 15:16:43          NaN       NaN
2021-11-03 16:51:19         32.0      40.0

Note however that piso does not yet handle intervals that are closed on both sides.

  • Related