Home > Software engineering >  Interleaving IDs of groups, with IDs coming from 2 separate arrays, defining size of groups
Interleaving IDs of groups, with IDs coming from 2 separate arrays, defining size of groups

Time:12-07

  1. The problem

I have 2 different functions, named update() and reset(). I am referring to these functions as 'IDs' in the title of the ticket. They apply successively to groups of contiguous rows from a data array.

The sizes of groups to which apply these functions are defined in related arrays.

import numpy as np

# 'data' array, size 5.
data = np.array([1., 4., 3., 8., 9.], dtype='float64')

# Group sizes to which will apply 'update' function.
update_gs = np.array([1, 0, 2, 2], dtype='int64')
# Sum of group sizes is 5, the number of rows in `data`.
# This array means that the 'update' function is to be performed successively on:
# - the 1st row
# - then without any row from data
# - then with the 2nd and 3rd rows from data
# - then with the 4th and 5th rows from data

# Group sizes to which will apply 'reset' function.
reset_gs = np.array([2, 0, 0, 2, 1], dtype='int64')
# Sum of group sizes is 5, the number of rows in `data`.
# This array means that the 'reset' function is to be performed successively on:
# - the 2 1st rows
# - a 2nd and 3rd reset will be run without any row from data
# - a 4th reset will be run with 3rd and 4th rows from data
# - a 5th reset will be run with the last row from data

The result I am looking for from this input data are 2 1D arrays:

  • each row of these result arrays relates to one occurrence of either update or reset.
  • consequently, these arrays are of size len(update_gs) len(reset_gs), i.e. here 8
  • one array is of int, defining group sizes again. In this result array, group sizes are defined as the number of the 'elapsed' rows since the previous occurrence of update or reset.
  • the other array is of bool, defining if a row relates to a reset function (value True) or an update function (value False)
  • regarding the order of appearance of the update and reset occurrences:
    • row groups in data of update and reset occurrences overlap. Considering the last rows in their respective row groups, the row coming the latter between each row group makes also the corresponding occurrence (update and reset) the latter.
    • When row groups of an update and a reset occurrence share the same last row, then update comes 1st in result array.

With previous data, expected results are:

group_sizes = np.array([1, # 1st action is 'update'. It applies to 1st row in data.
                        0, # There is a 2nd 'update' on 0 row.
                        1, # At 2nd row of data, there is the 1st 'reset' action.
                        0, # There is a 2nd 'reset' on 0 row.
                        0, # There is a 3rd 'reset' on 0 row.
                        1, # There is the 3rd 'update' taking place, with 1 row elapsed since previous function.
                        1, # There is a 4th 'reset', with 1 row elapsed since previous function.
                        1, # There is the 4th 'update' taking place, with 1 row elapsed since previous function.
                        0, # Occurs finally the last 'reset', with same ending row in 'data' than the previous 'update'
                       ], dtype='int64')
# Sum of the values gives 5, the total number of rows in 'data'.                        

function_ids = np.array([False, # 1st action is 'update'.
                         False, # There is a 2nd 'update'.
                         True,  # There is the 1st 'reset' action.
                         True,  # There is a 2nd 'reset'.
                         True,  # There is a 3rd 'reset'.
                         False, # There is the 3rd 'update'.
                         True,  # There is a 4th 'reset'.
                         False, # There is the 3rd 'update'.
                         True,  # Occurs finally the last 'reset'.
                        ], dtype='bool')
  1. Possibly a XY problem?

Above question is raised considering following topic:

  • I have as inputs the 2 arrays mentioned, reset_gs and update_gs. The way the related functions (update or reset) work depends what function has been applied on the previous group (a reset or an update?) and its result.
  • For this reason, I tried 1st to interleave the respective calls in 2 for loops, imbricated. This results in a complex code, that I have not been successful at making work yet. I believe after some time, it might be possible, with several ifs, buffer variables and boolean flags to communicate about previous states between the 2 interleaved for loops. In terms of maintenance / simplicity of code, it really does not seem adequate.
  • For this reason, I believe that opting to a flatten for loop is preferable. The 2 resulting arrays I am looking for (above question) will enable me to opt for such a solution.
  1. What about looping over data?

Size of data is several millions of rows. Size of update_gs is several thousands of rows. Size of reset_gs varies from several hundreds to several thousands of rows.

Looking for performance, I have reasons to believe that looping over update_gs & reset_gs (i.e. group definitions - several thousand iterations) instead of data (each row individually - several millions iterations) will result in faster code.

CodePudding user response:

This actually turned into 'how making a mergesort?'. I discovered about sortednp package in the way, that seems the fast way to do it.

import numpy as np
from sortednp import merge # merge of sorted numpy arrays

# Input data.
update = np.array([1, 0, 2, 2], dtype='int64')
reset = np.array([2, 0, 0, 2, 1], dtype='int64')

# Switching from group sizes to indices of group last row, in-place.
np.cumsum(update, out=update)
np.cumsum(reset, out=reset)
# Performing a merge of sorted arrays, keeping insertion index for 'update'.
merged_idx, (update_idx, _) = merge(update, reset, indices=True)
# Going back to group sizes.
merged_gs = np.diff(merged_idx, prepend=0)
# Final step to get 'function_ids'.
function_ids = np.ones(len(merged_gs), dtype="bool")
function_ids[update_idx] = False

# Here we are.
merged_gs
Out[9]: array([1, 0, 1, 0, 0, 1, 1, 1, 0])
function_ids
Out[13]: array([False, False,  True,  True,  True, False,  True, False,  True])
  • Related