I have two DFs:
- DF1 has a column for time
MsgTime
13:45:33
14:13:25
15:16:43
16:51:19
- DF2 has two columns of time, and additional two columns of information
MsgTime1 | MgsTime2 | Temperature | Humidity
13:40:33 13:50:13 21 45
14:16:43 14:26:43 22 56
16:49:11 16:59:02 32 40
So if MsgTime
is in between MsgTime1
and MsgTime2
from DF2, copy the Temperature
and Humidity
to DF1 as output. The loop should compare every row to find a match, if not leave it blank. A desired output should look like this:
MgsTime | Temperature | Humidity
13:45:33 21 45
14:13:25 NaN NaN
15:16:43 NaN NaN
16:51:19 32 40
I tried to do like a double for loop, but it didn't seem to work:
for i, row in DF1.iterrows():
for j, row2 in DF2.iterrows():
if (row2['MsgTime1'] <= row['MsgTime']) and (row['MsgTime'] <= row2['MsgTime2']):
row['Temperature'] = row2['Temperature']
row['Humidity'] = row2['Humidity']
else:
row2 = 1
CodePudding user response:
One way using pandas.Interval
with pandas.to_datetime
:
df2.index = df2[["MsgTime1", "MsgTime2"]].apply(lambda x: pd.Interval(*pd.to_datetime(x)), axis=1)
s = pd.to_datetime(df["MsgTime"])
for k in ["Temperature", "Humidity"]:
df[k] = s.map(df2[k])
print(df)
Output:
MsgTime Temperature Humidity
0 13:45:33 21.0 45.0
1 14:13:25 NaN NaN
2 15:16:43 NaN NaN
3 16:51:19 32.0 40.0
CodePudding user response:
A package piso (pandas interval set operations), provides a very fast implementation for this
df1 = pd.DataFrame(
pd.to_datetime(["13:45:33", "14:13:25", "15:16:43", "16:51:19"]),
columns=["MsgTime"],
)
df2 = pd.DataFrame(
{
"MsgTime1": pd.to_datetime(["13:40:33", "14:16:43", "16:49:11"]),
"MsgTime2": pd.to_datetime(["13:50:13", "14:26:43", "16:59:02"]),
"Temperature":[21,22,32],
"Humidity":[45,56,40],
}
)
solution
Create a dataframe with Temperature
and Humidity
columns, indexed by a pandas.IntervalIndex
df3 = df2[["Temperature", "Humidity"]].set_index(pd.IntervalIndex.from_arrays(df2["MsgTime1"], df2["MsgTime2"]))
df3
looks like this
Temperature Humidity
(2021-11-03 13:40:33, 2021-11-03 13:50:13] 21 45
(2021-11-03 14:16:43, 2021-11-03 14:26:43] 22 56
(2021-11-03 16:49:11, 2021-11-03 16:59:02] 32 40
Note since no date component was provided it just assumes today's date. You could alternatively use pandas.Timedelta
instead of pandas.Timestamp
- the approach will work the same
Next use piso.lookup
piso.lookup(df3, df1["MsgTime"])
which produces
Temperature Humidity
2021-11-03 13:45:33 21.0 45.0
2021-11-03 14:13:25 NaN NaN
2021-11-03 15:16:43 NaN NaN
2021-11-03 16:51:19 32.0 40.0
Note however that piso
does not yet handle intervals that are closed on both sides.