I have a script that produces bins using np.digitize of values like so:
| time | butter | bin |
| :---- | :------ | :--- |
| 2022-07-04 17:33:45 00:00 | 1.041967 | 3 |
| 2022-07-04 17:34:00 00:00 | 1.041967 | 4 |
| 2022-07-04 17:34:15 00:00 | 1.041966 | 4 |
| 2022-07-04 17:34:30 00:00 | 1.041967 | 4 |
| 2022-07-04 17:34:45 00:00 | 1.041968 | 4 |
| 2022-07-04 17:35:00 00:00 | 1.041969 | 4 |
| 2022-07-04 17:35:15 00:00 | 1.041971 | 4 |
| 2022-07-04 17:35:30 00:00 | 1.041973 | 4 |
| 2022-07-04 17:35:45 00:00 | 1.041975 | 4 |
| 2022-07-04 17:36:00 00:00 | 1.041977 | 5 |
| 2022-07-04 17:36:15 00:00 | 1.041979 | 5 |
| 2022-07-04 17:36:30 00:00 | 1.041981 | 5 |
| 2022-07-04 17:36:45 00:00 | 1.041983 | 5 |
| 2022-07-04 17:37:00 00:00 | 1.041985 | 5 |
| 2022-07-04 17:37:15 00:00 | 1.041986 | 6 |
| 2022-07-04 17:37:30 00:00 | 1.041987 | 6 |
| 2022-07-04 17:37:45 00:00 | 1.041988 | 6 |
| 2022-07-04 17:38:00 00:00 | 1.041989 | 6 |
bins can increase/decrease and they can skip bins.
How can I calculate the local min/max of each bin, so that the result looks like this:
| time | butter | bin | min | max |
| :---- | :------ | :--- | :--- | :--- |
| 2022-07-04 17:33:45 00:00 | 1.041967 | 3 | 1.041967 | 1.041967 |
| 2022-07-04 17:34:00 00:00 | 1.041968 | 4 | 1.041966 | 1.041975 |
| 2022-07-04 17:34:15 00:00 | 1.041966 | 4 | 1.041966 | 1.041975 |
| 2022-07-04 17:34:30 00:00 | 1.041967 | 4 | 1.041966 | 1.041975 |
| 2022-07-04 17:34:45 00:00 | 1.041968 | 4 | 1.041966 | 1.041975 |
| 2022-07-04 17:35:00 00:00 | 1.041969 | 4 | 1.041966 | 1.041975 |
| 2022-07-04 17:35:15 00:00 | 1.041971 | 4 | 1.041966 | 1.041975 |
| 2022-07-04 17:35:30 00:00 | 1.041973 | 4 | 1.041966 | 1.041975 |
| 2022-07-04 17:35:45 00:00 | 1.041975 | 4 | 1.041966 | 1.041975 |
| 2022-07-04 17:36:00 00:00 | 1.041977 | 5 | 1.041977 | 1.041985 |
| 2022-07-04 17:36:15 00:00 | 1.041979 | 5 | 1.041977 | 1.041985 |
| 2022-07-04 17:36:30 00:00 | 1.041981 | 5 | 1.041977 | 1.041985 |
| 2022-07-04 17:36:45 00:00 | 1.041983 | 5 | 1.041977 | 1.041985 |
| 2022-07-04 17:37:00 00:00 | 1.041985 | 5 | 1.041977 | 1.041985 |
| 2022-07-04 17:37:15 00:00 | 1.041986 | 6 | 1.041986 | 1.041989 |
| 2022-07-04 17:37:30 00:00 | 1.041987 | 6 | 1.041986 | 1.041989 |
| 2022-07-04 17:37:45 00:00 | 1.041988 | 6 | 1.041986 | 1.041989 |
| 2022-07-04 17:38:00 00:00 | 1.041989 | 6 | 1.041986 | 1.041989 |
I was trying something with np.where and np.diff with np.sign but I haven't figured it out nor do I feel that's the best way to go about the question.
thank you in advance.
CodePudding user response:
Here are two ways to do what your question asks, one using pandas and the other using numpy (UPDATED to reflect OP's clarification in comments regarding binning to be on the basis of contiguously grouped bin
values):
res = df.assign(bin_index = (df.bin != df.bin.shift()).cumsum())
dfAggs = res[['butter', 'bin', 'bin_index']].groupby(['bin', 'bin_index']).agg([min, max])
dfAggs.columns = dfAggs.columns.droplevel()
res = res.join(dfAggs, on=['bin', 'bin_index']).drop(columns='bin_index')
print("", "pandas:", res, sep="\n")
a = df.copy().to_numpy()
print("", "input as numpy 2d array", a, sep="\n")
bin_index = a[:,2:3] != np.concatenate((np.full((1, 1), np.NaN), a[:-1,2:3]), axis = 0)
bin_index = np.cumsum(bin_index)
bins = np.unique(bin_index)
aggs = np.empty((a.shape[0], 2))
for b in bins:
mask = bin_index==b
aggs[mask, :] = (a[mask, 1].min(), a[mask, 1].max())
res = np.concatenate((a, aggs), axis=1)
print("", "numpy:", res, sep="\n")
Output:
input as pandas dataframe
time butter bin
0 2022-07-04 17:33:45 00:00 1.041967 3
1 2022-07-04 17:34:00 00:00 1.041967 4
2 2022-07-04 17:34:15 00:00 1.041966 4
3 2022-07-04 17:34:30 00:00 1.041967 4
4 2022-07-04 17:34:45 00:00 1.041968 4
5 2022-07-04 17:35:00 00:00 1.041969 4
6 2022-07-04 17:35:15 00:00 1.041971 4
7 2022-07-04 17:35:30 00:00 1.041973 4
8 2022-07-04 17:35:45 00:00 1.041975 4
9 2022-07-04 17:36:00 00:00 1.041977 5
10 2022-07-04 17:36:15 00:00 1.041979 5
11 2022-07-04 17:36:30 00:00 1.041981 5
12 2022-07-04 17:36:45 00:00 1.041983 5
13 2022-07-04 17:37:00 00:00 1.041985 5
14 2022-07-04 17:37:15 00:00 1.041986 6
15 2022-07-04 17:37:30 00:00 1.041987 6
16 2022-07-04 17:37:45 00:00 1.041988 6
17 2022-07-04 17:38:00 00:00 1.041989 6
18 2022-07-04 17:38:15 00:00 1.041990 4
19 2022-07-04 17:38:30 00:00 1.041995 4
pandas:
time butter bin min max
0 2022-07-04 17:33:45 00:00 1.041967 3 1.041967 1.041967
1 2022-07-04 17:34:00 00:00 1.041967 4 1.041966 1.041975
2 2022-07-04 17:34:15 00:00 1.041966 4 1.041966 1.041975
3 2022-07-04 17:34:30 00:00 1.041967 4 1.041966 1.041975
4 2022-07-04 17:34:45 00:00 1.041968 4 1.041966 1.041975
5 2022-07-04 17:35:00 00:00 1.041969 4 1.041966 1.041975
6 2022-07-04 17:35:15 00:00 1.041971 4 1.041966 1.041975
7 2022-07-04 17:35:30 00:00 1.041973 4 1.041966 1.041975
8 2022-07-04 17:35:45 00:00 1.041975 4 1.041966 1.041975
9 2022-07-04 17:36:00 00:00 1.041977 5 1.041977 1.041985
10 2022-07-04 17:36:15 00:00 1.041979 5 1.041977 1.041985
11 2022-07-04 17:36:30 00:00 1.041981 5 1.041977 1.041985
12 2022-07-04 17:36:45 00:00 1.041983 5 1.041977 1.041985
13 2022-07-04 17:37:00 00:00 1.041985 5 1.041977 1.041985
14 2022-07-04 17:37:15 00:00 1.041986 6 1.041986 1.041989
15 2022-07-04 17:37:30 00:00 1.041987 6 1.041986 1.041989
16 2022-07-04 17:37:45 00:00 1.041988 6 1.041986 1.041989
17 2022-07-04 17:38:00 00:00 1.041989 6 1.041986 1.041989
18 2022-07-04 17:38:15 00:00 1.041990 4 1.041990 1.041995
19 2022-07-04 17:38:30 00:00 1.041995 4 1.041990 1.041995
input as numpy 2d array
[['2022-07-04 17:33:45 00:00' 1.041967 3]
['2022-07-04 17:34:00 00:00' 1.041967 4]
['2022-07-04 17:34:15 00:00' 1.041966 4]
['2022-07-04 17:34:30 00:00' 1.041967 4]
['2022-07-04 17:34:45 00:00' 1.041968 4]
['2022-07-04 17:35:00 00:00' 1.041969 4]
['2022-07-04 17:35:15 00:00' 1.041971 4]
['2022-07-04 17:35:30 00:00' 1.041973 4]
['2022-07-04 17:35:45 00:00' 1.041975 4]
['2022-07-04 17:36:00 00:00' 1.041977 5]
['2022-07-04 17:36:15 00:00' 1.041979 5]
['2022-07-04 17:36:30 00:00' 1.041981 5]
['2022-07-04 17:36:45 00:00' 1.041983 5]
['2022-07-04 17:37:00 00:00' 1.041985 5]
['2022-07-04 17:37:15 00:00' 1.041986 6]
['2022-07-04 17:37:30 00:00' 1.041987 6]
['2022-07-04 17:37:45 00:00' 1.041988 6]
['2022-07-04 17:38:00 00:00' 1.041989 6]
['2022-07-04 17:38:15 00:00' 1.04199 4]
['2022-07-04 17:38:30 00:00' 1.041995 4]]
numpy:
[['2022-07-04 17:33:45 00:00' 1.041967 3 1.041967 1.041967]
['2022-07-04 17:34:00 00:00' 1.041967 4 1.041966 1.041975]
['2022-07-04 17:34:15 00:00' 1.041966 4 1.041966 1.041975]
['2022-07-04 17:34:30 00:00' 1.041967 4 1.041966 1.041975]
['2022-07-04 17:34:45 00:00' 1.041968 4 1.041966 1.041975]
['2022-07-04 17:35:00 00:00' 1.041969 4 1.041966 1.041975]
['2022-07-04 17:35:15 00:00' 1.041971 4 1.041966 1.041975]
['2022-07-04 17:35:30 00:00' 1.041973 4 1.041966 1.041975]
['2022-07-04 17:35:45 00:00' 1.041975 4 1.041966 1.041975]
['2022-07-04 17:36:00 00:00' 1.041977 5 1.041977 1.041985]
['2022-07-04 17:36:15 00:00' 1.041979 5 1.041977 1.041985]
['2022-07-04 17:36:30 00:00' 1.041981 5 1.041977 1.041985]
['2022-07-04 17:36:45 00:00' 1.041983 5 1.041977 1.041985]
['2022-07-04 17:37:00 00:00' 1.041985 5 1.041977 1.041985]
['2022-07-04 17:37:15 00:00' 1.041986 6 1.041986 1.041989]
['2022-07-04 17:37:30 00:00' 1.041987 6 1.041986 1.041989]
['2022-07-04 17:37:45 00:00' 1.041988 6 1.041986 1.041989]
['2022-07-04 17:38:00 00:00' 1.041989 6 1.041986 1.041989]
['2022-07-04 17:38:15 00:00' 1.04199 4 1.04199 1.041995]
['2022-07-04 17:38:30 00:00' 1.041995 4 1.04199 1.041995]]
Pandas explanation:
- Create
bin_index
column which detects changes inbin
and increments an id value for each such row - Use
DataFrame.groupby()
to perform the aggregation (min
,max
) based onbin_index
- Use
DataFrame.join()
(with preprocessing of the aggregation dataframeaggs
to remove the level of its MultiIndex namedbutter
) to addmin
andmax
columns to the original dataframe.
Numpy explanation:
- Create
bin_index
array which detects changes inbin
and increments an id value for each such row - Prepare
aggs
as an array with shapea.shape[0], 2
for receivingmin
andmax
columns for the correspondingbin
value in the input arraya
- Use a boolean mask for each unique
bin
value inbin_index
to perform aggregation on the corresponding rows of thebutter
column ofa
and to place these two values in the columns ofaggs
for these same rows - Use
numpy.concatenate()
to gluea
andaggs
together horizontally.