Home > Mobile >  Dataframe Mangling using Two Pointers. Pythonic and/or Vectorized solution possible?
Dataframe Mangling using Two Pointers. Pythonic and/or Vectorized solution possible?

Time:10-29

I have 2 dataframes as such

A) Larger timeframe time-series (15 rows * 2 cols)

             value            time_point
0          NaN  2020-11-23T00:00:00Z
1   352.111111  2020-11-24T00:00:00Z
2   352.588888  2020-11-25T00:00:00Z
3   352.866663  2020-11-26T00:00:00Z
4   352.583333  2020-11-27T00:00:00Z
5   352.420000  2020-12-01T00:00:00Z
6   354.473333  2020-12-02T00:00:00Z
7   356.766667  2020-12-03T00:00:00Z
8   358.540000  2020-12-04T00:00:00Z
9   359.133333  2020-12-07T00:00:00Z
10  360.946667  2020-12-08T00:00:00Z
11  360.636667  2020-12-09T00:00:00Z
12  359.086667  2020-12-10T00:00:00Z
...
...

B) Granular time-series (arbitrary rows * 2 cols)

       value            time_point
0   111   2020-11-23T00:01:00Z
1   111   2020-11-23T00:02:00Z
2   111   2020-11-23T00:03:00Z
3   111   2020-11-23T00:04:00Z
4   111   2020-11-23T00:05:00Z
5   111   2020-11-24T00:01:00Z
6   111   2020-11-24T00:02:00Z
7   111   2020-11-24T00:03:00Z
8   111   2020-11-24T00:04:00Z
9   111   2020-11-24T00:05:00Z
10  111   2020-11-25T00:01:00Z
11  111   2020-11-25T00:02:00Z
12  111   2020-11-25T00:03:00Z
13  111   2020-11-25T00:04:00Z
14  111   2020-11-25T00:05:00Z
...
...
120 111   2020-12-01T00:00:00Z

Final Result

       value            time_point      -2_days_prior_value
0   111   2020-11-23T00:01:00Z         NaN (larger timeframe has no value for 21st Nov, 2020)
1   111   2020-11-23T00:02:00Z         NaN
2   111   2020-11-23T00:03:00Z         NaN
3   111   2020-11-23T00:04:00Z         NaN
4   111   2020-11-23T00:05:00Z         NaN
5   111   2020-11-24T00:01:00Z         NaN
6   111   2020-11-24T00:02:00Z         NaN
7   111   2020-11-24T00:03:00Z         NaN
8   111   2020-11-24T00:04:00Z         NaN
9   111   2020-11-24T00:05:00Z         NaN
10  111   2020-11-25T00:01:00Z         NaN (23rd Nov in the larger timeframe has NaN. Hence the same)
11  111   2020-11-25T00:02:00Z         NaN
12  111   2020-11-25T00:03:00Z         NaN
13  111   2020-11-25T00:04:00Z         NaN (just 4 entries on 25th Nov instead of 5)
14  111   2020-11-26T00:01:00Z  352.111111 (taken from previous larger timeframe 24th Nov, 2020)

...
...
120 111   2020-12-01T00:00:00Z  352.866663 (taken from larger timeframe 26th Nov, 2020)

What I normally do is

  • shift the dataframe values by 2 rows (equivalent of -2)
  • Get the relevant series out (iteration is faster over series rather than a dataframe)
  • Create a new empty list to store the results -2_days_prior_value
  • Loop over the small timeframe (more rows) and look for the corresponding date in larger timeframe and if the date matches, append results to the new list.

These operations are getting very frequent. I'm aware of certain Pandas constructs but they are either inefficient or don't suit my use case.

  • Resampling. Have used it in the past but I couldn't apply the same to my problem.
  • Replace using loc or np.where. I would have to call the operation n number of times where n = len(larger timeframe)
  • I just can't duplicate/extrapolate evenly as the smaller timeframe can have irregular multiples.

Any inbuilt function in Pandas or a generic Pythonic solution which can help me compute the same result faster and/or using fewer lines of code or am I already on the right path?

CodePudding user response:

Can you instead

  1. subtract 1 minute from 2nd dataframe

  2. right-join this with first dataframe

  3. add 2 days and 1 minute to the time column in the joined column

Does this achieve what you are trying to do?

  • Related