Given a list of indexes (offset values) according which splitting a numpy array, I would like to adjust it so that the splitting does not occur on duplicate values. This means duplicate values will be in one chunk only.
I have worked out following piece of code, which gives the result, but I am not super proud of it. I would like to stay in numpy world and use vectorized numpy functions as much as possible.
But to check the indexes (offset values) I use a for
loop, and store the result in a list.
Do you have any idea how to vectorize the 2nd part?
If this can help, ar
is an ordered array.
(I am not using this info in below code).
import numpy as np
import vaex as vx
ar = np.array([8,8,8,10,11,11,13,14,15,15,18,19,20,21,22,22,22,22,22,22])
offsets = np.array([0,2,4,9,11,13,15,len(ar)])
_, unique_ind = np.unique(ar, return_index=True, return_inverse=False)
dup_ind = np.diff(unique_ind, append=len(ar))
dup_start = unique_ind[dup_ind > 1]
dup_end = dup_start dup_ind[dup_ind > 1]
print(f'initial offsets: {offsets}')
#print(f'dup start: {dup_start}')
#print(f'dup end: {dup_end}')
temp = []
for off in offsets:
for ind in range(len(dup_start)):
if off > dup_start[ind] and off < dup_end[ind]:
off = dup_start[ind]
break
temp.append(off)
# Remove duplicates
offsets = list(dict.fromkeys(temp))
print(f'modified offsets: {offsets}')
Results
initial offsets: [ 0 2 4 9 11 13 15 20]
modified offsets: [0, 4, 8, 11, 13, 14, 20]
CodePudding user response:
You can use np.digitize
to clamp the offsets into bins:
ar = np.array([8,8,8,10,11,11,13,14,15,15,18,22,22,22,22,22,22])
offsets = np.array([0,2,4,9,13,15,len(ar)])
_, unique_ind = np.unique(ar, return_index=True, return_inverse=False)
dup_ind = np.diff(unique_ind, append=len(ar))
dup = np.append(unique_ind[dup_ind > 1], len(ar))
offsets = dup[np.digitize(offsets, dup) - 1]
Output:
>>> offsets
array([ 0, 0, 4, 8, 11, 11, 17])
>>> np.unique(offsets)
array([ 0, 4, 8, 11, 17])
It can very likely be compressed into much less code (1 or 2 lines probably), but I figured that you really just want the loop eliminated, so I though I'd submit what I've found.
CodePudding user response:
Building upon @richardec's answer, here is a possible solution, in case it can help others.
import numpy as np
def adjust_offsets(values, offsets) -> np.ndarray:
# Creates bins describing when duplicates start and when they end.
_, indexes, counts = np.unique(values, return_index=True, return_counts=True)
duplicates = counts>1
dup_starts = indexes[duplicates]
dup_ends = dup_starts counts[duplicates]
# Zipping starts and ends.
# Bins of duplicates are interlayed with bins of unique values.
bins = np.dstack((dup_starts,dup_ends)).flatten()
# Set start of bin for offsets falling in a bin.
off_binned = np.digitize(offsets, bins, right=True)
return np.unique(np.where((off_binned%2)==0,offsets,bins[off_binned-1]))
# Test setup 1 / starting with no duplicates
# idx: 0,1,2,3,4, 5, 6, 7, 8, 9,10,11,12,13,14,15,16,17,18,19,20,21
ar = np.array([4,5,8,8,8,10,11,11,13,14,15,15,18,19,20,21,22,22,22,22,22,22])
offsets = np.array([0,2,4,9,11,14,18,len(ar)-1])
off_adjusted = adjust_offsets(ar, offsets)
ref = np.array([ 0, 2, 9, 10, 14, 16])
assert np.array_equal(off_adjusted, ref)
# Test setup 2 / starting with duplicates
# idx: 0,1,2, 3, 4, 5, 6, 7, 8, 9,10,11,12,13,14,15,16,17,18,19
ar = np.array([8,8,8,10,11,11,13,14,15,15,18,19,20,21,22,22,22,22,22,22])
offsets = np.array([0,2,4,9,11,13,15,len(ar)-1])
off_adjusted = adjust_offsets(ar, offsets)
ref = np.array([ 0, 4, 8, 11, 13, 14])
assert np.array_equal(off_adjusted, ref)
I use 2 tests as a previous version I come up with would not work in case the array was not starting with duplicates.
In 2 sentences, the algorithm follows the logic:
- if an offset value (index) fall in a bin defined by [start of duplicate, end of duplicate], then it is reset to 'start of duplicate'.
- if not, it is unmodified
There is possibly faster, as I could see from this related SO answer that np.diff
is faster than np.unique
and I think one can obtain the same kind of information with np.diff
. I started assessing this possibility, but it grew too complex for me.