How do I use python to create new rows to fill in time gap based on a specified number of rows to be-CodePudding

I'm using Python to join Google Fit data to another data series which lists activities by minute. Below is code to duplicate an example of how the data is currently formatted.

Dffit = pd.DataFrame ({"Time": ['2022-05-28 08:52:00','2022-05-28 09:00:00','2022-05-28 09:09:00'], 
                   "fitnessActivity": ['running','biking','swimming'], 
                   "minutes": [3,5,4]})
print(Dffit)

This produces starting data like this:

                   Time fitnessActivity minutes
0  2022-05-28 08:52:00         running        3
1  2022-05-28 09:00:00          biking        5
2  2022-05-28 09:09:00        swimming        4

I want to create new rows that increment the time column by 1 minute each, and duplicate the value in the fitnessActivity column. The minutes column specifies the number of rows needed.

I want my data to look like this:

               Time  fitnessActivity  minutes                                     
2022-05-28 08:52:00         running      3.0
2022-05-28 08:53:00         running      NaN
2022-05-28 08:54:00         running      NaN
2022-05-28 09:00:00          biking      5.0
2022-05-28 09:01:00          biking      NaN
2022-05-28 09:02:00          biking      NaN
2022-05-28 09:03:00          biking      NaN
2022-05-28 09:04:00          biking      NaN
2022-05-28 09:09:00        swimming      4.0
2022-05-28 09:10:00        swimming      NaN
2022-05-28 09:11:00        swimming      NaN
2022-05-28 09:12:00        swimming      NaN

I found several examples showing how to fill in missing time series, including this one, which I used as a model for writing my code. The problem is that it fills in based on other rows below in the dataset. Any time gaps are filled in with the previous activity, when what I actually want is to preserve time gaps, once the specified number of minutes have been added. Also, I want to add rows to the last activity. Currently, none are being added since there are no time rows below it.

# Convert Time to a datetime object
Dffit['Time'] = pd.to_datetime(Dffit['Time'],format='%Y-%m-%d %H:%M:%S.%f')
# Set Time column as index
Dffit.set_index(['Time'], inplace=True)
Dffit = Dffit.sort_index()
# Resample
out = Dffit[["fitnessActivity", "minutes"]].asfreq('60S')
out["fitnessActivity"] = Dffit["fitnessActivity"].asfreq('60S', method="ffill").asfreq('60S')
print(out)

My current output looks like this:

               Time fitnessActivity  minutes
                           
2022-05-28 08:52:00         running      3.0
2022-05-28 08:53:00         running      NaN
2022-05-28 08:54:00         running      NaN
2022-05-28 08:55:00         running      NaN
2022-05-28 08:56:00         running      NaN
2022-05-28 08:57:00         running      NaN
2022-05-28 08:58:00         running      NaN
2022-05-28 08:59:00         running      NaN
2022-05-28 09:00:00          biking      5.0
2022-05-28 09:01:00          biking      NaN
2022-05-28 09:02:00          biking      NaN
2022-05-28 09:03:00          biking      NaN
2022-05-28 09:04:00          biking      NaN
2022-05-28 09:05:00          biking      NaN
2022-05-28 09:06:00          biking      NaN
2022-05-28 09:07:00          biking      NaN
2022-05-28 09:08:00          biking      NaN
2022-05-28 09:09:00        swimming      4.0

CodePudding user response：

import pandas as pd

Dffit = pd.DataFrame({"Time": ['2022-05-28 08:52:00', '2022-05-28 09:00:00', '2022-05-28 09:09:00'],
                      "fitnessActivity": ['running', 'biking', 'swimming'],
                      "minutes": [3, 5, 4]})
Dffit['Time'] = pd.to_datetime(Dffit['Time'], format='%Y-%m-%d %H:%M:%S.%f')

Dffit.set_index(['Time'], inplace=True)

aaa = [pd.date_range(i, periods=Dffit.loc[i, "minutes"], freq='60S') for i in Dffit.index]
aaa = aaa[0].union(aaa[1]).union(aaa[2])

Dffit = Dffit.reindex(aaa)
Dffit['fitnessActivity'] = Dffit['fitnessActivity'].fillna(method='ffill')
print(Dffit)

Output

                    fitnessActivity  minutes
2022-05-28 08:52:00         running      3.0
2022-05-28 08:53:00         running      NaN
2022-05-28 08:54:00         running      NaN
2022-05-28 09:00:00          biking      5.0
2022-05-28 09:01:00          biking      NaN
2022-05-28 09:02:00          biking      NaN
2022-05-28 09:03:00          biking      NaN
2022-05-28 09:04:00          biking      NaN
2022-05-28 09:09:00        swimming      4.0
2022-05-28 09:10:00        swimming      NaN
2022-05-28 09:11:00        swimming      NaN
2022-05-28 09:12:00        swimming      NaN

In the 'aaa' list generator, the necessary indexes are created, which are then combined into one array. Re-indexing with new indexes. The empty values of the 'fitnessActivity ' column are filled with the previous values.

If there is more than three data or even replace the line where the index lists are combined with

aaa = aaa[0].union(aaa[1]).union(aaa[2])

on the

aaa = pd.DatetimeIndex(np.array(np.hstack(aaa)))