I'm using Python to join Google Fit data to another data series which lists activities by minute. Below is code to duplicate an example of how the data is currently formatted.
Dffit = pd.DataFrame ({"Time": ['2022-05-28 08:52:00','2022-05-28 09:00:00','2022-05-28 09:09:00'],
"fitnessActivity": ['running','biking','swimming'],
"minutes": [3,5,4]})
print(Dffit)
This produces starting data like this:
Time fitnessActivity minutes
0 2022-05-28 08:52:00 running 3
1 2022-05-28 09:00:00 biking 5
2 2022-05-28 09:09:00 swimming 4
I want to create new rows that increment the time column by 1 minute each, and duplicate the value in the fitnessActivity column. The minutes column specifies the number of rows needed.
I want my data to look like this:
Time fitnessActivity minutes
2022-05-28 08:52:00 running 3.0
2022-05-28 08:53:00 running NaN
2022-05-28 08:54:00 running NaN
2022-05-28 09:00:00 biking 5.0
2022-05-28 09:01:00 biking NaN
2022-05-28 09:02:00 biking NaN
2022-05-28 09:03:00 biking NaN
2022-05-28 09:04:00 biking NaN
2022-05-28 09:09:00 swimming 4.0
2022-05-28 09:10:00 swimming NaN
2022-05-28 09:11:00 swimming NaN
2022-05-28 09:12:00 swimming NaN
I found several examples showing how to fill in missing time series, including this one, which I used as a model for writing my code. The problem is that it fills in based on other rows below in the dataset. Any time gaps are filled in with the previous activity, when what I actually want is to preserve time gaps, once the specified number of minutes have been added. Also, I want to add rows to the last activity. Currently, none are being added since there are no time rows below it.
# Convert Time to a datetime object
Dffit['Time'] = pd.to_datetime(Dffit['Time'],format='%Y-%m-%d %H:%M:%S.%f')
# Set Time column as index
Dffit.set_index(['Time'], inplace=True)
Dffit = Dffit.sort_index()
# Resample
out = Dffit[["fitnessActivity", "minutes"]].asfreq('60S')
out["fitnessActivity"] = Dffit["fitnessActivity"].asfreq('60S', method="ffill").asfreq('60S')
print(out)
My current output looks like this:
Time fitnessActivity minutes
2022-05-28 08:52:00 running 3.0
2022-05-28 08:53:00 running NaN
2022-05-28 08:54:00 running NaN
2022-05-28 08:55:00 running NaN
2022-05-28 08:56:00 running NaN
2022-05-28 08:57:00 running NaN
2022-05-28 08:58:00 running NaN
2022-05-28 08:59:00 running NaN
2022-05-28 09:00:00 biking 5.0
2022-05-28 09:01:00 biking NaN
2022-05-28 09:02:00 biking NaN
2022-05-28 09:03:00 biking NaN
2022-05-28 09:04:00 biking NaN
2022-05-28 09:05:00 biking NaN
2022-05-28 09:06:00 biking NaN
2022-05-28 09:07:00 biking NaN
2022-05-28 09:08:00 biking NaN
2022-05-28 09:09:00 swimming 4.0
CodePudding user response:
import pandas as pd
Dffit = pd.DataFrame({"Time": ['2022-05-28 08:52:00', '2022-05-28 09:00:00', '2022-05-28 09:09:00'],
"fitnessActivity": ['running', 'biking', 'swimming'],
"minutes": [3, 5, 4]})
Dffit['Time'] = pd.to_datetime(Dffit['Time'], format='%Y-%m-%d %H:%M:%S.%f')
Dffit.set_index(['Time'], inplace=True)
aaa = [pd.date_range(i, periods=Dffit.loc[i, "minutes"], freq='60S') for i in Dffit.index]
aaa = aaa[0].union(aaa[1]).union(aaa[2])
Dffit = Dffit.reindex(aaa)
Dffit['fitnessActivity'] = Dffit['fitnessActivity'].fillna(method='ffill')
print(Dffit)
Output
fitnessActivity minutes
2022-05-28 08:52:00 running 3.0
2022-05-28 08:53:00 running NaN
2022-05-28 08:54:00 running NaN
2022-05-28 09:00:00 biking 5.0
2022-05-28 09:01:00 biking NaN
2022-05-28 09:02:00 biking NaN
2022-05-28 09:03:00 biking NaN
2022-05-28 09:04:00 biking NaN
2022-05-28 09:09:00 swimming 4.0
2022-05-28 09:10:00 swimming NaN
2022-05-28 09:11:00 swimming NaN
2022-05-28 09:12:00 swimming NaN
In the 'aaa' list generator, the necessary indexes are created, which are then combined into one array. Re-indexing with new indexes. The empty values of the 'fitnessActivity ' column are filled with the previous values.
If there is more than three data or even replace the line where the index lists are combined with
aaa = aaa[0].union(aaa[1]).union(aaa[2])
on the
aaa = pd.DatetimeIndex(np.array(np.hstack(aaa)))