I have 2 dataframes as such
A) Larger timeframe time-series (15 rows * 2 cols)
value time_point
0 NaN 2020-11-23T00:00:00Z
1 352.111111 2020-11-24T00:00:00Z
2 352.588888 2020-11-25T00:00:00Z
3 352.866663 2020-11-26T00:00:00Z
4 352.583333 2020-11-27T00:00:00Z
5 352.420000 2020-12-01T00:00:00Z
6 354.473333 2020-12-02T00:00:00Z
7 356.766667 2020-12-03T00:00:00Z
8 358.540000 2020-12-04T00:00:00Z
9 359.133333 2020-12-07T00:00:00Z
10 360.946667 2020-12-08T00:00:00Z
11 360.636667 2020-12-09T00:00:00Z
12 359.086667 2020-12-10T00:00:00Z
...
...
B) Granular time-series (arbitrary rows * 2 cols)
value time_point
0 111 2020-11-23T00:01:00Z
1 111 2020-11-23T00:02:00Z
2 111 2020-11-23T00:03:00Z
3 111 2020-11-23T00:04:00Z
4 111 2020-11-23T00:05:00Z
5 111 2020-11-24T00:01:00Z
6 111 2020-11-24T00:02:00Z
7 111 2020-11-24T00:03:00Z
8 111 2020-11-24T00:04:00Z
9 111 2020-11-24T00:05:00Z
10 111 2020-11-25T00:01:00Z
11 111 2020-11-25T00:02:00Z
12 111 2020-11-25T00:03:00Z
13 111 2020-11-25T00:04:00Z
14 111 2020-11-25T00:05:00Z
...
...
120 111 2020-12-01T00:00:00Z
Final Result
value time_point -2_days_prior_value
0 111 2020-11-23T00:01:00Z NaN (larger timeframe has no value for 21st Nov, 2020)
1 111 2020-11-23T00:02:00Z NaN
2 111 2020-11-23T00:03:00Z NaN
3 111 2020-11-23T00:04:00Z NaN
4 111 2020-11-23T00:05:00Z NaN
5 111 2020-11-24T00:01:00Z NaN
6 111 2020-11-24T00:02:00Z NaN
7 111 2020-11-24T00:03:00Z NaN
8 111 2020-11-24T00:04:00Z NaN
9 111 2020-11-24T00:05:00Z NaN
10 111 2020-11-25T00:01:00Z NaN (23rd Nov in the larger timeframe has NaN. Hence the same)
11 111 2020-11-25T00:02:00Z NaN
12 111 2020-11-25T00:03:00Z NaN
13 111 2020-11-25T00:04:00Z NaN (just 4 entries on 25th Nov instead of 5)
14 111 2020-11-26T00:01:00Z 352.111111 (taken from previous larger timeframe 24th Nov, 2020)
...
...
120 111 2020-12-01T00:00:00Z 352.866663 (taken from larger timeframe 26th Nov, 2020)
What I normally do is
- shift the dataframe values by 2 rows (equivalent of -2)
- Get the relevant series out (iteration is faster over series rather than a dataframe)
- Create a new empty list to store the results
-2_days_prior_value
- Loop over the small timeframe (more rows) and look for the corresponding date in larger timeframe and if the date matches, append results to the new list.
These operations are getting very frequent. I'm aware of certain Pandas constructs but they are either inefficient or don't suit my use case.
Resampling
. Have used it in the past but I couldn't apply the same to my problem.- Replace using
loc
ornp.where
. I would have to call the operationn
number of times wheren = len(larger timeframe)
- I just can't duplicate/extrapolate evenly as the smaller timeframe can have irregular multiples.
Any inbuilt function in Pandas or a generic Pythonic solution which can help me compute the same result faster and/or using fewer lines of code or am I already on the right path?
CodePudding user response:
Can you instead
subtract 1 minute from 2nd dataframe
right-join this with first dataframe
add 2 days and 1 minute to the time column in the joined column
Does this achieve what you are trying to do?