Home > Net >  Creating a Python function that will calculate an average value or sum values based on n previous va
Creating a Python function that will calculate an average value or sum values based on n previous va

Time:07-17

I have a dataframe of biometric data. I need to reduce it down to the rows that only contain a value for 'CGM'. While I know how to remove all nulls in a column I want to make sure the values from the other columns are averaged / summed before doing so.

I was trying to create an if statement but I'm relatively new to python and can't seem to find the code that works.

What I'd like the code to do would be to first check if CGM has a value and then move to HR column and get an average of that value and the previous n values (first instance this would be 4, second instance would be 15, frequency changes throughout dataframe). For steps, distance (these are populated throughout dataframe), calories, I'd like to sum the previous n values.

Any suggestions or code would be greatly appreciated. If there is something I haven't made clear please do let me know.

This would be the output I'd like to produce:

index Time CGM HR steps distance calories
0 2018-05-03 00:03:00 5 70.0 0.0 0.0 6.94
1 2018-05-03 00:18:00 4.8 72.533 0.0 0.0 17.8372
2 2018-05-03 00:33:00 4.8 72.4 0.0 0.0 17.8372

Sample Data:

index Time CGM HR steps distance calories
0 2018-05-03 00:00:00 NaN 70.0 0.0 0.0 1.1735
1 2018-05-03 00:01:00 NaN 71.0 0.0 0.0 1.1735
2 2018-05-03 00:02:00 NaN 69.0 0.0 0.0 1.1735
3 2018-05-03 00:03:00 5.0 70.0 0.0 0.0 1.1735
4 2018-05-03 00:04:00 NaN 71.0 0.0 0.0 1.1735
5 2018-05-03 00:05:00 NaN 72.0 0.0 0.0 1.29085
6 2018-05-03 00:06:00 NaN 71.0 0.0 0.0 1.1735
7 2018-05-03 00:07:00 NaN 72.0 0.0 0.0 1.1735
8 2018-05-03 00:08:00 NaN 73.0 0.0 0.0 1.1735
9 2018-05-03 00:09:00 NaN 71.0 0.0 0.0 1.1735
10 2018-05-03 00:10:00 NaN 73.0 0.0 0.0 1.1735
11 2018-05-03 00:11:00 NaN 73.0 0.0 0.0 1.29085
12 2018-05-03 00:12:00 NaN 71.0 0.0 0.0 1.1735
13 2018-05-03 00:13:00 NaN 73.0 0.0 0.0 1.1735
14 2018-05-03 00:14:00 NaN 78.0 0.0 0.0 1.1735
15 2018-05-03 00:15:00 NaN 77.0 0.0 0.0 1.1735
16 2018-05-03 00:16:00 NaN 71.0 0.0 0.0 1.1735
17 2018-05-03 00:17:00 NaN 72.0 0.0 0.0 1.1735
18 2018-05-03 00:18:00 4.8 70.0 0.0 0.0 1.1735
19 2018-05-03 00:19:00 NaN 74.0 0.0 0.0 1.1735
20 2018-05-03 00:20:00 NaN 77.0 0.0 0.0 1.29085
21 2018-05-03 00:21:00 NaN 72.0 0.0 0.0 1.1735
22 2018-05-03 00:22:00 NaN 75.0 0.0 0.0 1.29085
23 2018-05-03 00:23:00 NaN 73.0 0.0 0.0 1.1735
24 2018-05-03 00:24:00 NaN 72.0 0.0 0.0 1.1735
25 2018-05-03 00:25:00 NaN 75.0 0.0 0.0 1.1735
26 2018-05-03 00:26:00 NaN 71.0 0.0 0.0 1.1735
27 2018-05-03 00:27:00 NaN 69.0 0.0 0.0 1.1735
28 2018-05-03 00:28:00 NaN 70.0 0.0 0.0 1.1735
29 2018-05-03 00:29:00 NaN 68.0 0.0 0.0 1.1735
30 2018-05-03 00:30:00 NaN 72.0 0.0 0.0 1.1735
31 2018-05-03 00:31:00 NaN 75.0 0.0 0.0 1.1735
32 2018-05-03 00:32:00 NaN 69.0 0.0 0.0 1.1735
33 2018-05-03 00:33:00 4.8 74.0 0.0 0.0 1.1735

CodePudding user response:

Ok this may not be the best solution but it gives you what you want:

Starting from this df

df

    CGM HR      steps   distance    calories    Time
0   NaN 70.0    0.0     0.0         1.17350     2018-05-03 00:00:00
1   NaN 71.0    0.0     0.0         1.17350     2018-05-03 00:01:00
2   NaN 69.0    0.0     0.0         1.17350     2018-05-03 00:02:00
3   5.0 70.0    0.0     0.0         1.17350     2018-05-03 00:03:00
4   NaN 71.0    0.0     0.0         1.17350     2018-05-03 00:04:00
5   NaN 72.0    0.0     0.0         1.29085     2018-05-03 00:05:00
6   7.0 70.0    0.0     0.0         1.17350     2018-05-03 00:06:00

We will divide the dataframe based on non-NaN values and then aggregate the columns. Then we will concatenate the single dataframes to have the final one:

import pandas as pd
import numpy as np

# Splitting by non-nan values
events = np.split(df, np.where(~np.isnan(df.CGM))[0])

# removing empty DataFrames
events = [ev for ev in events if not ev.empty]

# Saving first timestamp because if Nan should be readded
first_nan = events[0][:1]
events = [ev.append(ev2[:1])[1:] for ev, ev2 in zip(events[:-1], events[1:])]

if first_nan['CGM'].isna().item():
    events[0] = first_nan.append(events[0])

# Preparing new dataframe
df_2 = pd.DataFrame()

# Concatenate aggregation of dataframe
for e in events:
  new_df = pd.DataFrame(e.agg({"Time": "max", "CGM":"sum", "HR": "mean", "steps": "sum", "distance": "sum", "calories":"sum"})).transpose()
  df_2 = pd.concat([df_2, new_df])

Output:

df_2

Time                    CGM   HR    steps   distance    calories

0   2018-05-03 00:03:00 5.0   70.0  0.0     0.0         4.694
0   2018-05-03 00:06:00 7.0   71.0  0.0     0.0         3.63785
  • Related