How to compress a dataframe so that the row-to-row difference is at least some threshold in magnitud-CodePudding

I have a DataFrame containing approximately 7000 rows and 2 columns which looks like this:

    Time    Voltage
0    0.0  32.965541
1    0.5  32.914965
2    1.0  32.904850
3    1.5  32.864389
4   14.0  31.680907
5   24.0  31.023417
6   24.5  31.003186
7   25.0  30.982956
8   25.5  30.942495
9   26.0  30.952610
10  50.0  30.062469
11  50.5  30.022009
12  56.0  29.961317
13  56.5  29.941087
14  57.0  29.930971
15  57.5  29.910741
16  58.0  29.890511
17  73.0  21.211641
18  73.5  21.181296
19  74.0  21.201526
20  87.5  21.120604
21  88.0  21.080143
22  88.5  21.110489

I want to "compress" the dataframe to only time steps that correspond with a voltage difference of magnitude of at least one volt from one step to the next.

For example, starting at time 0.0, the next voltage whose difference in magnitude is at least one volt is at time 14.0. Then, from time 14.0, the next voltage whose difference in magnitude is at least one volt is at time 50.0.

CodePudding user response：

New answer

Okay so after some time, I think I've finally come to understand what you're asking. It seems that you want to essentially "compress" the data so that each chronological time step has a difference in voltage that is at least 1V in magnitude.

For example, starting with the voltage at time 0.0, the next voltage whose difference is at least of magnitude 1V is the voltage at time 14.0. Then, starting from the voltage at time 14.0, the next voltage difference above the magnitude threshold is at time 50.0. Then you start looking from time 50.0, and so on.

This can be achieved using what's known as a two-pointer algorithm. You essentially track -- no surprise -- two pointers: one which is fixed at a certain index, and one that increments one step at a time from the first pointer. Then when some condition is met, the first pointer is updated to the second pointer's location, and the second pointer then starts incrementing again. Here's a basic implementation:

def compress(x, thresh=1):
    i, j, idxs = 0, 1, [0]
    while j < len(x):
        if abs(x[i] - x[j]) >= thresh:
            idxs.append(j)
            i = j
        j  = 1
    return idxs

Which, when passed the Voltage column from the dataframe produces this result:

In [26]: df.iloc[compress(df.Voltage, 1), :]
Out[26]:
    Time    Voltage
0    0.0  32.965541
4   14.0  31.680907
10  50.0  30.062469
17  73.0  21.211641

Old answer

I'll leave this old answer up so that future readers may still benefit from it.

You can get the change from one row above with .diff():

In [7]: df["deltaVoltage"] = df["Voltage"].diff()

In [8]: df
Out[8]:
    Time    Voltage  deltaVoltage
0    0.0  32.965541           NaN
1    0.5  32.914965     -0.050576
2    1.0  32.904850     -0.010115
3    1.5  32.864389     -0.040461
4   14.0  31.680907     -1.183482
5   24.0  31.023417     -0.657490
6   24.5  31.003186     -0.020230
7   25.0  30.982956     -0.020230
8   25.5  30.942495     -0.040461
9   26.0  30.952610      0.010115
10  50.0  30.062469     -0.890140
11  50.5  30.022009     -0.040461
12  56.0  29.961317     -0.060691
13  56.5  29.941087     -0.020230
14  57.0  29.930971     -0.010115
15  57.5  29.910741     -0.020230
16  58.0  29.890511     -0.020230
17  73.0  21.211641     -8.678869
18  73.5  21.181296     -0.030346
19  74.0  21.201526      0.020230
20  87.5  21.120604     -0.080922
21  88.0  21.080143     -0.040461
22  88.5  21.110489      0.030346

Then, you can select the rows where the absolute value in the change in voltage is >= 1:

In [9]: df[df["deltaVoltage"].abs() >= 1]
Out[9]:
    Time    Voltage  deltaVoltage
4   14.0  31.680907     -1.183482
17  73.0  21.211641     -8.678869

Or, if you don't actually want the change in voltage saved as a column:

In [10]: df[df["Voltage"].diff().abs() >= 1]
Out[10]:
    Time    Voltage
4   14.0  31.680907
17  73.0  21.211641