- The problem
I have 2 different functions, named update()
and reset()
.
I am referring to these functions as 'IDs' in the title of the ticket.
They apply successively to groups of contiguous rows from a data
array.
The sizes of groups to which apply these functions are defined in related arrays.
import numpy as np
# 'data' array, size 5.
data = np.array([1., 4., 3., 8., 9.], dtype='float64')
# Group sizes to which will apply 'update' function.
update_gs = np.array([1, 0, 2, 2], dtype='int64')
# Sum of group sizes is 5, the number of rows in `data`.
# This array means that the 'update' function is to be performed successively on:
# - the 1st row
# - then without any row from data
# - then with the 2nd and 3rd rows from data
# - then with the 4th and 5th rows from data
# Group sizes to which will apply 'reset' function.
reset_gs = np.array([2, 0, 0, 2, 1], dtype='int64')
# Sum of group sizes is 5, the number of rows in `data`.
# This array means that the 'reset' function is to be performed successively on:
# - the 2 1st rows
# - a 2nd and 3rd reset will be run without any row from data
# - a 4th reset will be run with 3rd and 4th rows from data
# - a 5th reset will be run with the last row from data
The result I am looking for from this input data are 2 1D arrays:
- each row of these result arrays relates to one occurrence of either
update
orreset
. - consequently, these arrays are of size
len(update_gs) len(reset_gs)
, i.e. here 8 - one array is of
int
, defining group sizes again. In this result array, group sizes are defined as the number of the 'elapsed' rows since the previous occurrence ofupdate
orreset
. - the other array is of
bool
, defining if a row relates to areset
function (valueTrue
) or anupdate
function (valueFalse
) - regarding the order of appearance of the
update
andreset
occurrences:- row groups in
data
ofupdate
andreset
occurrences overlap. Considering the last rows in their respective row groups, the row coming the latter between each row group makes also the corresponding occurrence (update
andreset
) the latter. - When row groups of an
update
and areset
occurrence share the same last row, thenupdate
comes 1st in result array.
- row groups in
With previous data, expected results are:
group_sizes = np.array([1, # 1st action is 'update'. It applies to 1st row in data.
0, # There is a 2nd 'update' on 0 row.
1, # At 2nd row of data, there is the 1st 'reset' action.
0, # There is a 2nd 'reset' on 0 row.
0, # There is a 3rd 'reset' on 0 row.
1, # There is the 3rd 'update' taking place, with 1 row elapsed since previous function.
1, # There is a 4th 'reset', with 1 row elapsed since previous function.
1, # There is the 4th 'update' taking place, with 1 row elapsed since previous function.
0, # Occurs finally the last 'reset', with same ending row in 'data' than the previous 'update'
], dtype='int64')
# Sum of the values gives 5, the total number of rows in 'data'.
function_ids = np.array([False, # 1st action is 'update'.
False, # There is a 2nd 'update'.
True, # There is the 1st 'reset' action.
True, # There is a 2nd 'reset'.
True, # There is a 3rd 'reset'.
False, # There is the 3rd 'update'.
True, # There is a 4th 'reset'.
False, # There is the 3rd 'update'.
True, # Occurs finally the last 'reset'.
], dtype='bool')
- Possibly a XY problem?
Above question is raised considering following topic:
- I have as inputs the 2 arrays mentioned,
reset_gs
andupdate_gs
. The way the related functions (update
orreset
) work depends what function has been applied on the previous group (areset
or anupdate
?) and its result. - For this reason, I tried 1st to interleave the respective calls in 2
for
loops, imbricated. This results in a complex code, that I have not been successful at making work yet. I believe after some time, it might be possible, with severalif
s, buffer variables and boolean flags to communicate about previous states between the 2 interleavedfor
loops. In terms of maintenance / simplicity of code, it really does not seem adequate. - For this reason, I believe that opting to a flatten
for
loop is preferable. The 2 resulting arrays I am looking for (above question) will enable me to opt for such a solution.
- What about looping over
data
?
Size of data
is several millions of rows.
Size of update_gs
is several thousands of rows.
Size of reset_gs
varies from several hundreds to several thousands of rows.
Looking for performance, I have reasons to believe that looping over update_gs
& reset_gs
(i.e. group definitions - several thousand iterations) instead of data
(each row individually - several millions iterations) will result in faster code.
CodePudding user response:
This actually turned into 'how making a mergesort?'. I discovered about sortednp package in the way, that seems the fast way to do it.
import numpy as np
from sortednp import merge # merge of sorted numpy arrays
# Input data.
update = np.array([1, 0, 2, 2], dtype='int64')
reset = np.array([2, 0, 0, 2, 1], dtype='int64')
# Switching from group sizes to indices of group last row, in-place.
np.cumsum(update, out=update)
np.cumsum(reset, out=reset)
# Performing a merge of sorted arrays, keeping insertion index for 'update'.
merged_idx, (update_idx, _) = merge(update, reset, indices=True)
# Going back to group sizes.
merged_gs = np.diff(merged_idx, prepend=0)
# Final step to get 'function_ids'.
function_ids = np.ones(len(merged_gs), dtype="bool")
function_ids[update_idx] = False
# Here we are.
merged_gs
Out[9]: array([1, 0, 1, 0, 0, 1, 1, 1, 0])
function_ids
Out[13]: array([False, False, True, True, True, False, True, False, True])