Consider two dataframes below:
date,price
2022-07-23 02:00:00,22834.24
2022-07-23 03:00:00,22808.55
2022-07-23 04:00:00,22895.41
2022-07-23 05:00:00,22902.46
2022-07-23 06:00:00,22827.46
2022-07-23 19:00:00,22272.57
2022-07-23 20:00:00,22325.82
2022-07-23 21:00:00,22243.32
2022-07-23 22:00:00,22469.08
2022-07-23 23:00:00,22451.07
2022-07-24 00:00:00,22549.18
2022-07-24 01:00:00,22423.58
2022-07-24 02:00:00,22469.09
2022-07-24 04:00:00,22396.51
2022-07-24 05:00:00,22749.98
2022-07-24 06:00:00,22679.01
2022-07-24 07:00:00,22701.61
date,price,passed_bars
2022-07-23 02:00:00,22834.24,30.0
2022-07-23 19:00:00,22272.57,13.0
2022-07-24 04:00:00,22396.51,4.0
We can use the following snippet code to regenerate the dataframes:
import pandas as pd
li1 = [{'date': '2022-07-23 02:00:00', 'price': 22834.24}, {'date': '2022-07-23 03:00:00', 'price': 22808.55},
{'date': '2022-07-23 04:00:00', 'price': 22895.41}, {'date': '2022-07-23 05:00:00', 'price': 22902.46},
{'date': '2022-07-23 06:00:00', 'price': 22827.46}, {'date': '2022-07-23 19:00:00', 'price': 22272.57},
{'date': '2022-07-23 20:00:00', 'price': 22325.82}, {'date': '2022-07-23 21:00:00', 'price': 22243.32},
{'date': '2022-07-23 22:00:00', 'price': 22469.08}, {'date': '2022-07-23 23:00:00', 'price': 22451.07},
{'date': '2022-07-24 00:00:00', 'price': 22549.18}, {'date': '2022-07-24 01:00:00', 'price': 22423.58},
{'date': '2022-07-24 02:00:00', 'price': 22469.09}, {'date': '2022-07-24 04:00:00', 'price': 22396.51},
{'date': '2022-07-24 05:00:00', 'price': 22749.98}, {'date': '2022-07-24 06:00:00', 'price': 22679.01},
{'date': '2022-07-24 07:00:00', 'price': 22701.61}]
li2 = [{'date': '2022-07-23 02:00:00', 'price': 22834.24, 'passed_bars': 30.0},
{'date': '2022-07-23 19:00:00', 'price': 22272.57, 'passed_bars': 13.0},
{'date': '2022-07-24 04:00:00', 'price': 22396.51, 'passed_bars': 4.0}]
df1 = pd.DataFrame.from_records(li1)
df2 = pd.DataFrame.from_records(li2)
The objective is to add a new columns to the first dataframe df1
, where each value must be computed based on the logic below:
This new column is the time distance between current record in
df1
and the nearest record in thedf2
such thatdf1.date.iloc[i] >= nearest_to_current(df2.date)
.
Based on the logic above, the desired dataframe should look like this one:
date,price, passed_time
2022-07-23 02:00:00,22834.24, 0 hours
2022-07-23 03:00:00,22808.55, 1 hours
2022-07-23 04:00:00,22895.41, 2 hours
2022-07-23 05:00:00,22902.46, 3 hours
2022-07-23 06:00:00,22827.46, 4 hours
2022-07-23 19:00:00,22272.57, 0 hours
2022-07-23 20:00:00,22325.82, 1 hours
2022-07-23 21:00:00,22243.32, 2 hours
2022-07-23 22:00:00,22469.08, 3 hours
2022-07-23 23:00:00,22451.07, 4 hours
2022-07-24 00:00:00,22549.18, 5 hours
2022-07-24 01:00:00,22423.58, 6 hours
2022-07-24 02:00:00,22469.09, 7 hours
2022-07-24 04:00:00,22396.51, 0 hours
2022-07-24 05:00:00,22749.98, 1 hours
2022-07-24 06:00:00,22679.01, 2 hours
2022-07-24 07:00:00,22701.61, 3 hours
CodePudding user response:
Try pd.merge_asof
(the dataframes must be sorted!):
df1["date"] = pd.to_datetime(df1["date"])
df2["date"] = pd.to_datetime(df2["date"])
df2["passed_time"] = df2["date"]
x = pd.merge_asof(df1, df2[["date", "passed_time"]], on="date")
x["passed_time"] = (x["date"] - x["passed_time"]) / pd.Timedelta("1 hour")
print(x)
Prints:
date price passed_time
0 2022-07-23 02:00:00 22834.24 0.0
1 2022-07-23 03:00:00 22808.55 1.0
2 2022-07-23 04:00:00 22895.41 2.0
3 2022-07-23 05:00:00 22902.46 3.0
4 2022-07-23 06:00:00 22827.46 4.0
5 2022-07-23 19:00:00 22272.57 0.0
6 2022-07-23 20:00:00 22325.82 1.0
7 2022-07-23 21:00:00 22243.32 2.0
8 2022-07-23 22:00:00 22469.08 3.0
9 2022-07-23 23:00:00 22451.07 4.0
10 2022-07-24 00:00:00 22549.18 5.0
11 2022-07-24 01:00:00 22423.58 6.0
12 2022-07-24 02:00:00 22469.09 7.0
13 2022-07-24 04:00:00 22396.51 0.0
14 2022-07-24 05:00:00 22749.98 1.0
15 2022-07-24 06:00:00 22679.01 2.0
16 2022-07-24 07:00:00 22701.61 3.0
CodePudding user response:
To add a new column in the dataframe df
, one could use the following:
df["passed_time"] = passed_time_values
Given that you calculated the passed times. As for calculating passed times, one suggestion is the following:
def get_passed_time(date_time, date_time_list):
time_deltas = [date_time - date_time_df2 for date_time_df2 in date_time_list]
# Convert time deltas to hours
time_deltas = [time_delta.days * 24 time_delta.seconds/3600 for time_delta in time_deltas]
positive_t_deltas = [t_delta for t_delta in time_deltas if t_delta > 0.]
passed_time = min(positive_t_deltas)
return passed_time
There may be other easier ways to do it also.
Sample code:
date_time_list = list(pd.to_datetime(df1["date"]))
df2_date_time_list = list(pd.to_datetime(df2["date"]))
passed_time_values = [get_passed_time(date_time, df2_date_time_list)
for date_time in date_time_list]
df1["passed_time"] = passed_time_values