Home > database >  Python local min/max while bin has not changed
Python local min/max while bin has not changed

Time:07-05

I have a script that produces bins using np.digitize of values like so:

| time | butter | bin |
| :---- | :------ | :--- |
| 2022-07-04 17:33:45 00:00 | 1.041967 | 3 |
| 2022-07-04 17:34:00 00:00 | 1.041967 | 4 |
| 2022-07-04 17:34:15 00:00 | 1.041966 | 4 |
| 2022-07-04 17:34:30 00:00 | 1.041967 | 4 |
| 2022-07-04 17:34:45 00:00 | 1.041968 | 4 |
| 2022-07-04 17:35:00 00:00 | 1.041969 | 4 |
| 2022-07-04 17:35:15 00:00 | 1.041971 | 4 |
| 2022-07-04 17:35:30 00:00 | 1.041973 | 4 |
| 2022-07-04 17:35:45 00:00 | 1.041975 | 4 |
| 2022-07-04 17:36:00 00:00 | 1.041977 | 5 |
| 2022-07-04 17:36:15 00:00 | 1.041979 | 5 |
| 2022-07-04 17:36:30 00:00 | 1.041981 | 5 |
| 2022-07-04 17:36:45 00:00 | 1.041983 | 5 |
| 2022-07-04 17:37:00 00:00 | 1.041985 | 5 |
| 2022-07-04 17:37:15 00:00 | 1.041986 | 6 |
| 2022-07-04 17:37:30 00:00 | 1.041987 | 6 |
| 2022-07-04 17:37:45 00:00 | 1.041988 | 6 |
| 2022-07-04 17:38:00 00:00 | 1.041989 | 6 |

bins can increase/decrease and they can skip bins.

How can I calculate the local min/max of each bin, so that the result looks like this:

| time | butter | bin | min | max |
| :---- | :------ | :--- | :--- | :--- |
| 2022-07-04 17:33:45 00:00 | 1.041967 | 3 | 1.041967 | 1.041967 |
| 2022-07-04 17:34:00 00:00 | 1.041968 | 4 | 1.041966 | 1.041975 |
| 2022-07-04 17:34:15 00:00 | 1.041966 | 4 | 1.041966 | 1.041975 |
| 2022-07-04 17:34:30 00:00 | 1.041967 | 4 | 1.041966 | 1.041975 |
| 2022-07-04 17:34:45 00:00 | 1.041968 | 4 | 1.041966 | 1.041975 |
| 2022-07-04 17:35:00 00:00 | 1.041969 | 4 | 1.041966 | 1.041975 |
| 2022-07-04 17:35:15 00:00 | 1.041971 | 4 | 1.041966 | 1.041975 |
| 2022-07-04 17:35:30 00:00 | 1.041973 | 4 | 1.041966 | 1.041975 |
| 2022-07-04 17:35:45 00:00 | 1.041975 | 4 | 1.041966 | 1.041975 |
| 2022-07-04 17:36:00 00:00 | 1.041977 | 5 | 1.041977 | 1.041985 |
| 2022-07-04 17:36:15 00:00 | 1.041979 | 5 | 1.041977 | 1.041985 |
| 2022-07-04 17:36:30 00:00 | 1.041981 | 5 | 1.041977 | 1.041985 |
| 2022-07-04 17:36:45 00:00 | 1.041983 | 5 | 1.041977 | 1.041985 |
| 2022-07-04 17:37:00 00:00 | 1.041985 | 5 | 1.041977 | 1.041985 |
| 2022-07-04 17:37:15 00:00 | 1.041986 | 6 | 1.041986 | 1.041989 |
| 2022-07-04 17:37:30 00:00 | 1.041987 | 6 | 1.041986 | 1.041989 |
| 2022-07-04 17:37:45 00:00 | 1.041988 | 6 | 1.041986 | 1.041989 |
| 2022-07-04 17:38:00 00:00 | 1.041989 | 6 | 1.041986 | 1.041989 |

I was trying something with np.where and np.diff with np.sign but I haven't figured it out nor do I feel that's the best way to go about the question.

thank you in advance.

CodePudding user response:

Here are two ways to do what your question asks, one using pandas and the other using numpy (UPDATED to reflect OP's clarification in comments regarding binning to be on the basis of contiguously grouped bin values):

res = df.assign(bin_index = (df.bin != df.bin.shift()).cumsum())

dfAggs = res[['butter', 'bin', 'bin_index']].groupby(['bin', 'bin_index']).agg([min, max])
dfAggs.columns = dfAggs.columns.droplevel()
res = res.join(dfAggs, on=['bin', 'bin_index']).drop(columns='bin_index')
print("", "pandas:", res, sep="\n")

a = df.copy().to_numpy()
print("", "input as numpy 2d array", a, sep="\n")
bin_index = a[:,2:3] != np.concatenate((np.full((1, 1), np.NaN), a[:-1,2:3]), axis = 0)
bin_index = np.cumsum(bin_index)
bins = np.unique(bin_index)
aggs = np.empty((a.shape[0], 2))
for b in bins:
    mask = bin_index==b
    aggs[mask, :] = (a[mask, 1].min(), a[mask, 1].max())
res = np.concatenate((a, aggs), axis=1)
print("", "numpy:", res, sep="\n")

Output:

input as pandas dataframe
                         time    butter  bin
0   2022-07-04 17:33:45 00:00  1.041967    3
1   2022-07-04 17:34:00 00:00  1.041967    4
2   2022-07-04 17:34:15 00:00  1.041966    4
3   2022-07-04 17:34:30 00:00  1.041967    4
4   2022-07-04 17:34:45 00:00  1.041968    4
5   2022-07-04 17:35:00 00:00  1.041969    4
6   2022-07-04 17:35:15 00:00  1.041971    4
7   2022-07-04 17:35:30 00:00  1.041973    4
8   2022-07-04 17:35:45 00:00  1.041975    4
9   2022-07-04 17:36:00 00:00  1.041977    5
10  2022-07-04 17:36:15 00:00  1.041979    5
11  2022-07-04 17:36:30 00:00  1.041981    5
12  2022-07-04 17:36:45 00:00  1.041983    5
13  2022-07-04 17:37:00 00:00  1.041985    5
14  2022-07-04 17:37:15 00:00  1.041986    6
15  2022-07-04 17:37:30 00:00  1.041987    6
16  2022-07-04 17:37:45 00:00  1.041988    6
17  2022-07-04 17:38:00 00:00  1.041989    6
18  2022-07-04 17:38:15 00:00  1.041990    4
19  2022-07-04 17:38:30 00:00  1.041995    4

pandas:
                         time    butter  bin       min       max
0   2022-07-04 17:33:45 00:00  1.041967    3  1.041967  1.041967
1   2022-07-04 17:34:00 00:00  1.041967    4  1.041966  1.041975
2   2022-07-04 17:34:15 00:00  1.041966    4  1.041966  1.041975
3   2022-07-04 17:34:30 00:00  1.041967    4  1.041966  1.041975
4   2022-07-04 17:34:45 00:00  1.041968    4  1.041966  1.041975
5   2022-07-04 17:35:00 00:00  1.041969    4  1.041966  1.041975
6   2022-07-04 17:35:15 00:00  1.041971    4  1.041966  1.041975
7   2022-07-04 17:35:30 00:00  1.041973    4  1.041966  1.041975
8   2022-07-04 17:35:45 00:00  1.041975    4  1.041966  1.041975
9   2022-07-04 17:36:00 00:00  1.041977    5  1.041977  1.041985
10  2022-07-04 17:36:15 00:00  1.041979    5  1.041977  1.041985
11  2022-07-04 17:36:30 00:00  1.041981    5  1.041977  1.041985
12  2022-07-04 17:36:45 00:00  1.041983    5  1.041977  1.041985
13  2022-07-04 17:37:00 00:00  1.041985    5  1.041977  1.041985
14  2022-07-04 17:37:15 00:00  1.041986    6  1.041986  1.041989
15  2022-07-04 17:37:30 00:00  1.041987    6  1.041986  1.041989
16  2022-07-04 17:37:45 00:00  1.041988    6  1.041986  1.041989
17  2022-07-04 17:38:00 00:00  1.041989    6  1.041986  1.041989
18  2022-07-04 17:38:15 00:00  1.041990    4  1.041990  1.041995
19  2022-07-04 17:38:30 00:00  1.041995    4  1.041990  1.041995

input as numpy 2d array
[['2022-07-04 17:33:45 00:00' 1.041967 3]
 ['2022-07-04 17:34:00 00:00' 1.041967 4]
 ['2022-07-04 17:34:15 00:00' 1.041966 4]
 ['2022-07-04 17:34:30 00:00' 1.041967 4]
 ['2022-07-04 17:34:45 00:00' 1.041968 4]
 ['2022-07-04 17:35:00 00:00' 1.041969 4]
 ['2022-07-04 17:35:15 00:00' 1.041971 4]
 ['2022-07-04 17:35:30 00:00' 1.041973 4]
 ['2022-07-04 17:35:45 00:00' 1.041975 4]
 ['2022-07-04 17:36:00 00:00' 1.041977 5]
 ['2022-07-04 17:36:15 00:00' 1.041979 5]
 ['2022-07-04 17:36:30 00:00' 1.041981 5]
 ['2022-07-04 17:36:45 00:00' 1.041983 5]
 ['2022-07-04 17:37:00 00:00' 1.041985 5]
 ['2022-07-04 17:37:15 00:00' 1.041986 6]
 ['2022-07-04 17:37:30 00:00' 1.041987 6]
 ['2022-07-04 17:37:45 00:00' 1.041988 6]
 ['2022-07-04 17:38:00 00:00' 1.041989 6]
 ['2022-07-04 17:38:15 00:00' 1.04199 4]
 ['2022-07-04 17:38:30 00:00' 1.041995 4]]

numpy:
[['2022-07-04 17:33:45 00:00' 1.041967 3 1.041967 1.041967]
 ['2022-07-04 17:34:00 00:00' 1.041967 4 1.041966 1.041975]
 ['2022-07-04 17:34:15 00:00' 1.041966 4 1.041966 1.041975]
 ['2022-07-04 17:34:30 00:00' 1.041967 4 1.041966 1.041975]
 ['2022-07-04 17:34:45 00:00' 1.041968 4 1.041966 1.041975]
 ['2022-07-04 17:35:00 00:00' 1.041969 4 1.041966 1.041975]
 ['2022-07-04 17:35:15 00:00' 1.041971 4 1.041966 1.041975]
 ['2022-07-04 17:35:30 00:00' 1.041973 4 1.041966 1.041975]
 ['2022-07-04 17:35:45 00:00' 1.041975 4 1.041966 1.041975]
 ['2022-07-04 17:36:00 00:00' 1.041977 5 1.041977 1.041985]
 ['2022-07-04 17:36:15 00:00' 1.041979 5 1.041977 1.041985]
 ['2022-07-04 17:36:30 00:00' 1.041981 5 1.041977 1.041985]
 ['2022-07-04 17:36:45 00:00' 1.041983 5 1.041977 1.041985]
 ['2022-07-04 17:37:00 00:00' 1.041985 5 1.041977 1.041985]
 ['2022-07-04 17:37:15 00:00' 1.041986 6 1.041986 1.041989]
 ['2022-07-04 17:37:30 00:00' 1.041987 6 1.041986 1.041989]
 ['2022-07-04 17:37:45 00:00' 1.041988 6 1.041986 1.041989]
 ['2022-07-04 17:38:00 00:00' 1.041989 6 1.041986 1.041989]
 ['2022-07-04 17:38:15 00:00' 1.04199 4 1.04199 1.041995]
 ['2022-07-04 17:38:30 00:00' 1.041995 4 1.04199 1.041995]]

Pandas explanation:

  • Create bin_index column which detects changes in bin and increments an id value for each such row
  • Use DataFrame.groupby() to perform the aggregation (min, max) based on bin_index
  • Use DataFrame.join() (with preprocessing of the aggregation dataframe aggs to remove the level of its MultiIndex named butter) to add min and max columns to the original dataframe.

Numpy explanation:

  • Create bin_index array which detects changes in bin and increments an id value for each such row
  • Prepare aggs as an array with shape a.shape[0], 2 for receiving min and max columns for the corresponding bin value in the input array a
  • Use a boolean mask for each unique bin value in bin_index to perform aggregation on the corresponding rows of the butter column of a and to place these two values in the columns of aggs for these same rows
  • Use numpy.concatenate() to glue a and aggs together horizontally.
  • Related