Home > Software design >  Retain all columns after resample (pandas)
Retain all columns after resample (pandas)

Time:07-07

My data looks like so:

import pandas as pd
import numpy as np

BG_test_df = pd.DataFrame(
    {'PERSON_ID': [1, 1, 1],
     'TS': ['2021-08-14 19:00:27', '2021-08-14 20:00:27', '2021-08-14 22:35:27'],             
     'bias': ["Not outside of acceptable operation. Refer to patient education","Not outside of acceptable operation. Refer to patient education","Suboptimal"]}
    )

CGM_test_df = pd.DataFrame(
    {'PERSON_ID': [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1],
     'SG': [50, 51, 52, 53, 54, 55, 400, 400, 400, 400, 400, 400, 400, 400,50, 51, 52, 53, 54, 55, 400],
     'TS': ['2021-08-14 18:30:27','2021-08-14 18:35:27','2021-08-14 18:40:27','2021-08-14 18:45:27','2021-08-14 18:50:27','2021-08-14 18:55:27', '2021-08-14 19:00:27', '2021-08-14 19:30:27','2021-08-14 19:35:27','2021-08-14 19:40:27','2021-08-14 19:45:27','2021-08-14 19:50:27','2021-08-14 19:55:27','2021-08-14 20:00:27', '2021-08-14 20:30:27','2021-08-14 20:35:27','2021-08-14 20:40:27','2021-08-14 20:45:27','2021-08-14 20:50:27','2021-08-14 20:55:27','2021-08-14 21:00:27']
     }
    )

problematic = BG_test_df.loc[BG_test_df['bias'] == "Suboptimal"]

# Convert to datetime
problematic['BG_TS'] = pd.to_datetime(problematic['TS'])
CGM_test_df['CGM_TS'] = pd.to_datetime(CGM_test_df['TS'])

merged = CGM_test_df.merge(problematic, on = "PERSON_ID")

#resample in 5 min intervals fill in empty rows with na
filled = (merged.set_index('CGM_TS').resample('5T').sum().reset_index())
filled.replace(0, np.nan, inplace=True)

When I perform a resample to set the CGM_TS column to have 5 minute intervals, I lose my other columns. In particular, I need the BG_TS column to continue the rest of my analysis. How can I retain the BG_TS column in the filled dataset?

Thanks in advance

CodePudding user response:

You can specify a different aggregation function (e.g. min) for BG_TS to keep it in the result:

merged.set_index('CGM_TS').resample('5T').agg({'PERSON_ID':np.sum, 'SG':np.sum, 'BG_TS':np.min}).reset_index()

Output (for your sample data):

                CGM_TS  PERSON_ID   SG               BG_TS
0  2021-08-14 18:30:00          1   50 2021-08-14 22:35:27
1  2021-08-14 18:35:00          1   51 2021-08-14 22:35:27
2  2021-08-14 18:40:00          1   52 2021-08-14 22:35:27
3  2021-08-14 18:45:00          1   53 2021-08-14 22:35:27
4  2021-08-14 18:50:00          1   54 2021-08-14 22:35:27
5  2021-08-14 18:55:00          1   55 2021-08-14 22:35:27
6  2021-08-14 19:00:00          1  400 2021-08-14 22:35:27
7  2021-08-14 19:05:00          0    0                 NaT
...
14 2021-08-14 19:40:00          1  400 2021-08-14 22:35:27
15 2021-08-14 19:45:00          1  400 2021-08-14 22:35:27
16 2021-08-14 19:50:00          1  400 2021-08-14 22:35:27
...
22 2021-08-14 20:20:00          0    0                 NaT
23 2021-08-14 20:25:00          0    0                 NaT
24 2021-08-14 20:30:00          1   50 2021-08-14 22:35:27
25 2021-08-14 20:35:00          1   51 2021-08-14 22:35:27
26 2021-08-14 20:40:00          1   52 2021-08-14 22:35:27
27 2021-08-14 20:45:00          1   53 2021-08-14 22:35:27
28 2021-08-14 20:50:00          1   54 2021-08-14 22:35:27
29 2021-08-14 20:55:00          1   55 2021-08-14 22:35:27
30 2021-08-14 21:00:00          1  400 2021-08-14 22:35:27
  • Related