Splitting an array according input offset values, but leaving duplicated in a same chunk-CodePudding

Given a list of indexes (offset values) according which splitting a numpy array, I would like to adjust it so that the splitting does not occur on duplicate values. This means duplicate values will be in one chunk only.

I have worked out following piece of code, which gives the result, but I am not super proud of it. I would like to stay in numpy world and use vectorized numpy functions as much as possible. But to check the indexes (offset values) I use a for loop, and store the result in a list.

Do you have any idea how to vectorize the 2nd part?

If this can help, ar is an ordered array. (I am not using this info in below code).

import numpy as np
import vaex as vx

ar = np.array([8,8,8,10,11,11,13,14,15,15,18,19,20,21,22,22,22,22,22,22])
offsets = np.array([0,2,4,9,11,13,15,len(ar)])

_, unique_ind = np.unique(ar, return_index=True, return_inverse=False)
dup_ind = np.diff(unique_ind, append=len(ar))
dup_start = unique_ind[dup_ind > 1]
dup_end = dup_start   dup_ind[dup_ind > 1]
print(f'initial offsets: {offsets}')
#print(f'dup start: {dup_start}')
#print(f'dup end: {dup_end}')

temp = []
for off in offsets:
    for ind in range(len(dup_start)):
        if off > dup_start[ind] and off < dup_end[ind]:
            off = dup_start[ind]
            break
    temp.append(off)

# Remove duplicates
offsets = list(dict.fromkeys(temp))
print(f'modified offsets: {offsets}')

Results

initial offsets: [ 0  2  4  9 11 13 15 20]
modified offsets: [0, 4, 8, 11, 13, 14, 20]

CodePudding user response：

You can use np.digitize to clamp the offsets into bins:

ar = np.array([8,8,8,10,11,11,13,14,15,15,18,22,22,22,22,22,22])
offsets = np.array([0,2,4,9,13,15,len(ar)])

_, unique_ind = np.unique(ar, return_index=True, return_inverse=False)
dup_ind = np.diff(unique_ind, append=len(ar))
dup = np.append(unique_ind[dup_ind > 1], len(ar))

offsets = dup[np.digitize(offsets, dup) - 1]

Output:

>>> offsets
array([ 0,  0,  4,  8, 11, 11, 17])

>>> np.unique(offsets)
array([ 0,  4,  8, 11, 17])

It can very likely be compressed into much less code (1 or 2 lines probably), but I figured that you really just want the loop eliminated, so I though I'd submit what I've found.

CodePudding user response：

Building upon @richardec's answer, here is a possible solution, in case it can help others.

import numpy as np

def adjust_offsets(values, offsets) -> np.ndarray:
    # Creates bins describing when duplicates start and when they end.
    _, indexes, counts = np.unique(values, return_index=True, return_counts=True)
    duplicates = counts>1
    dup_starts = indexes[duplicates]
    dup_ends = dup_starts   counts[duplicates]
    # Zipping starts and ends.
    # Bins of duplicates are interlayed with bins of unique values.
    bins = np.dstack((dup_starts,dup_ends)).flatten()
    # Set start of bin for offsets falling in a bin.
    off_binned = np.digitize(offsets, bins, right=True)
    return np.unique(np.where((off_binned%2)==0,offsets,bins[off_binned-1]))

# Test setup 1 / starting with no duplicates
# idx:         0,1,2,3,4, 5, 6, 7, 8, 9,10,11,12,13,14,15,16,17,18,19,20,21
ar = np.array([4,5,8,8,8,10,11,11,13,14,15,15,18,19,20,21,22,22,22,22,22,22])
offsets = np.array([0,2,4,9,11,14,18,len(ar)-1])
off_adjusted = adjust_offsets(ar, offsets)
ref = np.array([ 0,  2,  9, 10, 14, 16])
assert np.array_equal(off_adjusted, ref)

# Test setup 2 / starting with duplicates
# idx:         0,1,2, 3, 4, 5, 6, 7, 8, 9,10,11,12,13,14,15,16,17,18,19
ar = np.array([8,8,8,10,11,11,13,14,15,15,18,19,20,21,22,22,22,22,22,22])
offsets = np.array([0,2,4,9,11,13,15,len(ar)-1])
off_adjusted = adjust_offsets(ar, offsets)
ref = np.array([ 0,  4,  8, 11, 13, 14])
assert np.array_equal(off_adjusted, ref)

I use 2 tests as a previous version I come up with would not work in case the array was not starting with duplicates.

In 2 sentences, the algorithm follows the logic:

if an offset value (index) fall in a bin defined by [start of duplicate, end of duplicate], then it is reset to 'start of duplicate'.
if not, it is unmodified

There is possibly faster, as I could see from this related SO answer that np.diff is faster than np.unique and I think one can obtain the same kind of information with np.diff. I started assessing this possibility, but it grew too complex for me.