I have a question of how o get interpolated data across a several different "blocks" of time. In a nut shell, I have a dataset like this:
>>> import pandas as pd
>>> test_data = pd.read_csv("test_data.csv")
>>> test_data
ID Condition_num Condition_type Rating Timestamp_ms
0 101 1 Active 58.0 30
1 101 1 Active 59.0 60
2 101 1 Active 65.0 90
3 101 1 Active 70.0 120
4 101 1 Active 80.0 150
5 101 2 Break NaN 180
6 101 3 Active_2 55.0 210
7 101 3 Active_2 60.0 240
8 101 3 Active_2 63.0 270
9 101 3 Active_2 70.0 300
10 101 4 Break NaN 330
11 101 5 Active_3 69.0 360
12 101 5 Active_3 71.0 390
13 101 5 Active_3 50.0 420
14 101 5 Active_3 41.0 450
15 101 5 Active_3 43.0 480
I need to "resample" the final column to a another time interval (e.g. 40 ms) to match it to an external data set. I have been using the following code:
#Setting the column with timestamps as a datetime with the correct units, then set index
test_data['Timestamp_ms'] = pd.to_datetime(test_data['Timestamp_ms'], unit='ms')
test_data = test_data.set_index('Timestamp_ms')
#Resample index to start at 0, resample to the highest resolution 1ms, then resample to 800ms
test_data = test_data.reindex(
pd.date_range(start=pd.to_datetime(0, unit='ms'), end=test_data.index.max(), freq='ms')
)
test_data = test_data.resample('1ms').interpolate().resample('40ms').interpolate()
#Round ms to intergers
test_data.xpos = test_data..round()
Which gives me this:
ID Condition_num Condition_type Rating
1970-01-01 00:00:00.000 NaN NaN NaN NaN
1970-01-01 00:00:00.040 101.0 1.000000 NaN 58.333333
1970-01-01 00:00:00.080 101.0 1.000000 NaN 63.000000
1970-01-01 00:00:00.120 101.0 1.000000 Active 70.000000
1970-01-01 00:00:00.160 101.0 1.333333 NaN 75.833333
1970-01-01 00:00:00.200 101.0 2.666667 NaN 59.166667
1970-01-01 00:00:00.240 101.0 3.000000 Active_2 60.000000
1970-01-01 00:00:00.280 101.0 3.000000 NaN 65.333333
1970-01-01 00:00:00.320 101.0 3.666667 NaN 69.666667
1970-01-01 00:00:00.360 101.0 5.000000 Active_3 69.000000
1970-01-01 00:00:00.400 101.0 5.000000 NaN 64.000000
1970-01-01 00:00:00.440 101.0 5.000000 NaN 44.000000
1970-01-01 00:00:00.480 101.0 5.000000 Active_3 43.000000
The only issue is I cannot figure out which ratings are happening during the "Active" conditions and whether the ratings I am seeing are caused by extrapolations of the "breaks" where there are no ratings. In so many words, I want the interpolation in the "Active" blocks but also have everything aligned to the beginning of the whole data set.
I have tried entering Zero ratings for NaN and interpolating from the top of each condition, but that seems only to make the problem worse by altering the ratings more.
Any advice would be greatly appreciated!
CodePudding user response:
I think you need to do all of your logic inside of a groupby, IIUC:
mask = df.Condition_type.ne('Break')
df2 = (df[mask].groupby('Condition_type') # Groupby Condition_type, excluding "Break" rows.
.apply(lambda x: x.resample('1ms') # To each group... resample it.
.interpolate() # Interpolate
.ffill() # Fill values, this just applies to the Condition_type.
.resample('40ms')# Resample to 40ms
.asfreq()) # No need to interpolate in this direction.
.reset_index('Condition_type', drop=True)) # We no longer need this extra index~
# Force the index to our resample'd interval, this will reveal the breaks:
df2 = df2.asfreq('40ms')
print(df2)
Output:
ID Condition_num Condition_type Rating
Timestamp_ms
1970-01-01 00:00:00.000 NaN NaN NaN NaN
1970-01-01 00:00:00.040 101.0 1.0 Active 58.333333
1970-01-01 00:00:00.080 101.0 1.0 Active 63.000000
1970-01-01 00:00:00.120 101.0 1.0 Active 70.000000
1970-01-01 00:00:00.160 NaN NaN NaN NaN
1970-01-01 00:00:00.200 NaN NaN NaN NaN
1970-01-01 00:00:00.240 101.0 3.0 Active_2 60.000000
1970-01-01 00:00:00.280 101.0 3.0 Active_2 65.333333
1970-01-01 00:00:00.320 NaN NaN NaN NaN
1970-01-01 00:00:00.360 101.0 5.0 Active_3 69.000000
1970-01-01 00:00:00.400 101.0 5.0 Active_3 64.000000
1970-01-01 00:00:00.440 101.0 5.0 Active_3 44.000000
1970-01-01 00:00:00.480 101.0 5.0 Active_3 43.000000