I have a dataframe of biometric data. I need to reduce it down to the rows that only contain a value for 'CGM'. While I know how to remove all nulls in a column I want to make sure the values from the other columns are averaged / summed before doing so.
I was trying to create an if statement but I'm relatively new to python and can't seem to find the code that works.
What I'd like the code to do would be to first check if CGM has a value and then move to HR column and get an average of that value and the previous n values (first instance this would be 4, second instance would be 15, frequency changes throughout dataframe). For steps, distance (these are populated throughout dataframe), calories, I'd like to sum the previous n values.
Any suggestions or code would be greatly appreciated. If there is something I haven't made clear please do let me know.
This would be the output I'd like to produce:
index | Time | CGM | HR | steps | distance | calories |
---|---|---|---|---|---|---|
0 | 2018-05-03 00:03:00 | 5 | 70.0 | 0.0 | 0.0 | 6.94 |
1 | 2018-05-03 00:18:00 | 4.8 | 72.533 | 0.0 | 0.0 | 17.8372 |
2 | 2018-05-03 00:33:00 | 4.8 | 72.4 | 0.0 | 0.0 | 17.8372 |
Sample Data:
index | Time | CGM | HR | steps | distance | calories |
---|---|---|---|---|---|---|
0 | 2018-05-03 00:00:00 | NaN | 70.0 | 0.0 | 0.0 | 1.1735 |
1 | 2018-05-03 00:01:00 | NaN | 71.0 | 0.0 | 0.0 | 1.1735 |
2 | 2018-05-03 00:02:00 | NaN | 69.0 | 0.0 | 0.0 | 1.1735 |
3 | 2018-05-03 00:03:00 | 5.0 | 70.0 | 0.0 | 0.0 | 1.1735 |
4 | 2018-05-03 00:04:00 | NaN | 71.0 | 0.0 | 0.0 | 1.1735 |
5 | 2018-05-03 00:05:00 | NaN | 72.0 | 0.0 | 0.0 | 1.29085 |
6 | 2018-05-03 00:06:00 | NaN | 71.0 | 0.0 | 0.0 | 1.1735 |
7 | 2018-05-03 00:07:00 | NaN | 72.0 | 0.0 | 0.0 | 1.1735 |
8 | 2018-05-03 00:08:00 | NaN | 73.0 | 0.0 | 0.0 | 1.1735 |
9 | 2018-05-03 00:09:00 | NaN | 71.0 | 0.0 | 0.0 | 1.1735 |
10 | 2018-05-03 00:10:00 | NaN | 73.0 | 0.0 | 0.0 | 1.1735 |
11 | 2018-05-03 00:11:00 | NaN | 73.0 | 0.0 | 0.0 | 1.29085 |
12 | 2018-05-03 00:12:00 | NaN | 71.0 | 0.0 | 0.0 | 1.1735 |
13 | 2018-05-03 00:13:00 | NaN | 73.0 | 0.0 | 0.0 | 1.1735 |
14 | 2018-05-03 00:14:00 | NaN | 78.0 | 0.0 | 0.0 | 1.1735 |
15 | 2018-05-03 00:15:00 | NaN | 77.0 | 0.0 | 0.0 | 1.1735 |
16 | 2018-05-03 00:16:00 | NaN | 71.0 | 0.0 | 0.0 | 1.1735 |
17 | 2018-05-03 00:17:00 | NaN | 72.0 | 0.0 | 0.0 | 1.1735 |
18 | 2018-05-03 00:18:00 | 4.8 | 70.0 | 0.0 | 0.0 | 1.1735 |
19 | 2018-05-03 00:19:00 | NaN | 74.0 | 0.0 | 0.0 | 1.1735 |
20 | 2018-05-03 00:20:00 | NaN | 77.0 | 0.0 | 0.0 | 1.29085 |
21 | 2018-05-03 00:21:00 | NaN | 72.0 | 0.0 | 0.0 | 1.1735 |
22 | 2018-05-03 00:22:00 | NaN | 75.0 | 0.0 | 0.0 | 1.29085 |
23 | 2018-05-03 00:23:00 | NaN | 73.0 | 0.0 | 0.0 | 1.1735 |
24 | 2018-05-03 00:24:00 | NaN | 72.0 | 0.0 | 0.0 | 1.1735 |
25 | 2018-05-03 00:25:00 | NaN | 75.0 | 0.0 | 0.0 | 1.1735 |
26 | 2018-05-03 00:26:00 | NaN | 71.0 | 0.0 | 0.0 | 1.1735 |
27 | 2018-05-03 00:27:00 | NaN | 69.0 | 0.0 | 0.0 | 1.1735 |
28 | 2018-05-03 00:28:00 | NaN | 70.0 | 0.0 | 0.0 | 1.1735 |
29 | 2018-05-03 00:29:00 | NaN | 68.0 | 0.0 | 0.0 | 1.1735 |
30 | 2018-05-03 00:30:00 | NaN | 72.0 | 0.0 | 0.0 | 1.1735 |
31 | 2018-05-03 00:31:00 | NaN | 75.0 | 0.0 | 0.0 | 1.1735 |
32 | 2018-05-03 00:32:00 | NaN | 69.0 | 0.0 | 0.0 | 1.1735 |
33 | 2018-05-03 00:33:00 | 4.8 | 74.0 | 0.0 | 0.0 | 1.1735 |
CodePudding user response:
Ok this may not be the best solution but it gives you what you want:
Starting from this df
df
CGM HR steps distance calories Time
0 NaN 70.0 0.0 0.0 1.17350 2018-05-03 00:00:00
1 NaN 71.0 0.0 0.0 1.17350 2018-05-03 00:01:00
2 NaN 69.0 0.0 0.0 1.17350 2018-05-03 00:02:00
3 5.0 70.0 0.0 0.0 1.17350 2018-05-03 00:03:00
4 NaN 71.0 0.0 0.0 1.17350 2018-05-03 00:04:00
5 NaN 72.0 0.0 0.0 1.29085 2018-05-03 00:05:00
6 7.0 70.0 0.0 0.0 1.17350 2018-05-03 00:06:00
We will divide the dataframe based on non-NaN values and then aggregate the columns. Then we will concatenate the single dataframes to have the final one:
import pandas as pd
import numpy as np
# Splitting by non-nan values
events = np.split(df, np.where(~np.isnan(df.CGM))[0])
# removing empty DataFrames
events = [ev for ev in events if not ev.empty]
# Saving first timestamp because if Nan should be readded
first_nan = events[0][:1]
events = [ev.append(ev2[:1])[1:] for ev, ev2 in zip(events[:-1], events[1:])]
if first_nan['CGM'].isna().item():
events[0] = first_nan.append(events[0])
# Preparing new dataframe
df_2 = pd.DataFrame()
# Concatenate aggregation of dataframe
for e in events:
new_df = pd.DataFrame(e.agg({"Time": "max", "CGM":"sum", "HR": "mean", "steps": "sum", "distance": "sum", "calories":"sum"})).transpose()
df_2 = pd.concat([df_2, new_df])
Output:
df_2
Time CGM HR steps distance calories
0 2018-05-03 00:03:00 5.0 70.0 0.0 0.0 4.694
0 2018-05-03 00:06:00 7.0 71.0 0.0 0.0 3.63785