Suppose sensors are attached to 3 climbers scaling a structure and these sensors capture a certain measurement at random times. The data are captured into the data frame below (the data frame is a lot longer than this):
df = pd.DataFrame({
'Name': ['Cody', 'Dustin', 'Dustin', 'Cody', 'Ryan', 'Dustin', 'Ryan', 'Cody'],
'Timestamp': ['08:10:23', '08:12:58', '08:15:02', '08:19:43', '08:21:00', '08:30:17', '08:34:01', '08:34:59'],
'Category': ['Body Temp', 'Altitude', 'Heart Rate', 'Body Temp', 'Heart Rate', 'Heart Rate', 'Altitude', 'Altitude'],
'Body Temp': [35.9, np.nan, np.nan, 36.2, np.nan, np.nan, np.nan, np.nan],
'Altitude': [np.nan, 7, np.nan, np.nan, np.nan, np.nan, 12, 6],
'Heart Rate': [np.nan, np.nan, 75, np.nan, 71, 69, np.nan, np.nan]
})
Name Timestamp Category Body Temp Altitude Heart Rate
0 Cody 08:10:23 Body Temp 35.9 NaN NaN
1 Dustin 08:12:58 Altitude NaN 7.0 NaN
2 Dustin 08:15:02 Heart Rate NaN NaN 75.0
3 Cody 08:19:43 Body Temp 36.2 NaN NaN
4 Ryan 08:21:00 Heart Rate NaN NaN 71.0
5 Dustin 08:30:17 Heart Rate NaN NaN 69.0
6 Ryan 08:34:01 Altitude NaN 12.0 NaN
7 Cody 08:34:59 Altitude NaN 6.0 NaN
The intention is to constantly update the measurements at every row, according to each climber and timestamp, such that every succeeding row for each climber will have their measurements updated.
So a result should look something like this:
Name Timestamp Category Body Temp Altitude Heart Rate
0 Cody 08:10:23 Body Temp 35.9 NaN NaN
1 Dustin 08:12:58 Altitude NaN 7.0 NaN
2 Dustin 08:15:02 Heart Rate NaN 7.0 75.0
3 Cody 08:19:43 Body Temp 36.2 NaN NaN
4 Ryan 08:21:00 Heart Rate NaN NaN 71.0
5 Dustin 08:30:17 Heart Rate NaN 7.0 69.0
6 Ryan 08:34:01 Altitude NaN 12.0 71.0
7 Cody 08:34:59 Altitude 36.2 6.0 NaN
So far I've thought of using .sort_value()
to separate the climbers and work from there. But I have trouble figuring out how to update each row continually. Are functions or iterrows required for this?
CodePudding user response:
The job essentially seems to be filling missing values by a previous value if such value exists at that measurement for each climber, so groupby.ffill
should do the job:
out = df[['Name']].join(df.groupby('Name').ffill())
Output:
Name Timestamp Category Body Temp Altitude Heart Rate
0 Cody 08:10:23 Body Temp 35.9 NaN NaN
1 Dustin 08:12:58 Altitude NaN 7.0 NaN
2 Dustin 08:15:02 Heart Rate NaN 7.0 75.0
3 Cody 08:19:43 Body Temp 36.2 NaN NaN
4 Ryan 08:21:00 Heart Rate NaN NaN 71.0
5 Dustin 08:30:17 Heart Rate NaN 7.0 69.0
6 Ryan 08:34:01 Altitude NaN 12.0 71.0
7 Cody 08:34:59 Altitude 36.2 6.0 NaN