Interpolating a data set in pandas while ignoring missing data-CodePudding

I have a question of how o get interpolated data across a several different "blocks" of time. In a nut shell, I have a dataset like this:

>>> import pandas as pd
>>> test_data = pd.read_csv("test_data.csv")
>>> test_data
     ID  Condition_num Condition_type  Rating  Timestamp_ms
0   101              1        Active     58.0            30
1   101              1        Active     59.0            60
2   101              1        Active     65.0            90
3   101              1        Active     70.0           120
4   101              1        Active     80.0           150
5   101              2          Break     NaN           180
6   101              3       Active_2    55.0           210
7   101              3       Active_2    60.0           240
8   101              3       Active_2    63.0           270
9   101              3       Active_2    70.0           300
10  101              4          Break     NaN           330
11  101              5       Active_3    69.0           360
12  101              5       Active_3    71.0           390
13  101              5       Active_3    50.0           420
14  101              5       Active_3    41.0           450
15  101              5       Active_3    43.0           480

I need to "resample" the final column to a another time interval (e.g. 40 ms) to match it to an external data set. I have been using the following code:

#Setting the column with timestamps as a datetime with the correct units, then set index
test_data['Timestamp_ms'] = pd.to_datetime(test_data['Timestamp_ms'], unit='ms')
test_data = test_data.set_index('Timestamp_ms')

#Resample index to start at 0, resample to the highest resolution 1ms, then resample to 800ms
test_data = test_data.reindex(

    pd.date_range(start=pd.to_datetime(0, unit='ms'), end=test_data.index.max(), freq='ms')

)

test_data = test_data.resample('1ms').interpolate().resample('40ms').interpolate()

#Round ms to intergers
test_data.xpos = test_data..round()

Which gives me this:

                            ID  Condition_num Condition_type     Rating
1970-01-01 00:00:00.000    NaN            NaN            NaN        NaN
1970-01-01 00:00:00.040  101.0       1.000000            NaN  58.333333
1970-01-01 00:00:00.080  101.0       1.000000            NaN  63.000000
1970-01-01 00:00:00.120  101.0       1.000000        Active   70.000000
1970-01-01 00:00:00.160  101.0       1.333333            NaN  75.833333
1970-01-01 00:00:00.200  101.0       2.666667            NaN  59.166667
1970-01-01 00:00:00.240  101.0       3.000000       Active_2  60.000000
1970-01-01 00:00:00.280  101.0       3.000000            NaN  65.333333
1970-01-01 00:00:00.320  101.0       3.666667            NaN  69.666667
1970-01-01 00:00:00.360  101.0       5.000000       Active_3  69.000000
1970-01-01 00:00:00.400  101.0       5.000000            NaN  64.000000
1970-01-01 00:00:00.440  101.0       5.000000            NaN  44.000000
1970-01-01 00:00:00.480  101.0       5.000000       Active_3  43.000000

The only issue is I cannot figure out which ratings are happening during the "Active" conditions and whether the ratings I am seeing are caused by extrapolations of the "breaks" where there are no ratings. In so many words, I want the interpolation in the "Active" blocks but also have everything aligned to the beginning of the whole data set.

I have tried entering Zero ratings for NaN and interpolating from the top of each condition, but that seems only to make the problem worse by altering the ratings more.

Any advice would be greatly appreciated!

CodePudding user response：

I think you need to do all of your logic inside of a groupby, IIUC:

mask = df.Condition_type.ne('Break')
df2 = (df[mask].groupby('Condition_type') # Groupby Condition_type, excluding "Break" rows.
                .apply(lambda x: x.resample('1ms') # To each group... resample it.
                                  .interpolate()   # Interpolate
                                  .ffill()         # Fill values, this just applies to the Condition_type.
                                  .resample('40ms')# Resample to 40ms
                                  .asfreq())       # No need to interpolate in this direction.
                .reset_index('Condition_type', drop=True)) # We no longer need this extra index~

# Force the index to our resample'd interval, this will reveal the breaks:
df2 = df2.asfreq('40ms')
print(df2)

Output:

                            ID  Condition_num Condition_type     Rating
Timestamp_ms
1970-01-01 00:00:00.000    NaN            NaN            NaN        NaN
1970-01-01 00:00:00.040  101.0            1.0         Active  58.333333
1970-01-01 00:00:00.080  101.0            1.0         Active  63.000000
1970-01-01 00:00:00.120  101.0            1.0         Active  70.000000
1970-01-01 00:00:00.160    NaN            NaN            NaN        NaN
1970-01-01 00:00:00.200    NaN            NaN            NaN        NaN
1970-01-01 00:00:00.240  101.0            3.0       Active_2  60.000000
1970-01-01 00:00:00.280  101.0            3.0       Active_2  65.333333
1970-01-01 00:00:00.320    NaN            NaN            NaN        NaN
1970-01-01 00:00:00.360  101.0            5.0       Active_3  69.000000
1970-01-01 00:00:00.400  101.0            5.0       Active_3  64.000000
1970-01-01 00:00:00.440  101.0            5.0       Active_3  44.000000
1970-01-01 00:00:00.480  101.0            5.0       Active_3  43.000000