I am attempting to find duplicate rows in a numpy array. The following code replicates the structure of my array which has n rows, m columns, and nz non-zero entries per row:
import numpy as np
import random
import datetime
def create_mat(n, m, nz):
sample_mat = np.zeros((n, m), dtype='uint8')
random.seed(42)
for row in range(0, n):
counter = 0
while counter < nz:
random_col = random.randrange(0, m-1, 1)
if sample_mat[row, random_col] == 0:
sample_mat[row, random_col] = 1
counter = 1
test = np.all(np.sum(sample_mat, axis=1) == nz)
print(f'All rows have {nz} elements: {test}')
return sample_mat
The code I am attempting to optimize is as follows:
if __name__ == '__main__':
threshold = 2
mat = create_mat(1800000, 108, 8)
print(f'Time: {datetime.datetime.now()}')
duplicate_rows, _, duplicate_counts = np.unique(mat, axis=0, return_counts=True, return_index=True)
duplicate_indices = [int(x) for x in np.argwhere(duplicate_counts >= threshold)]
print(f'Time: {datetime.datetime.now()}')
print(f'Duplicate rows: {len(duplicate_rows)} Sample inds: {duplicate_indices[0:5]} Sample counts: {duplicate_counts[0:5]}')
print(f'Sample rows:')
print(duplicate_rows[0:5])
My output is as follows:
All rows have 8 elements: True
Time: 2022-06-29 12:08:07.320834
Time: 2022-06-29 12:08:23.281633
Duplicate rows: 1799994 Sample inds: [508991, 553136, 930379, 1128637, 1290356] Sample counts: [1 1 1 1 1]
Sample rows:
[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0 1 0 1 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 1 1 1 1 0 0 0 0 1 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 1 0 0 1 0 0 0 1 0 1 0 1 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 1 1 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 1 0 1 1 0 0 0 0 0 1 0 0 0 0 1 0]]
I have considered using numba, but the challenge is that it does not operate using an axis parameter. Similarly, conversion to list and utilization of sets is an option, but then looping through to perform the duplicate counts seems "unpythonic".
Given that I need to run this code multiple times (since I am modifying the numpy array and then needing to re-search for duplicates), the time is critical. I have also tried to use multiprocessing against this step but the np.unique seems to be blocking (i.e. even when I try to run multiple versions of this, I end up constrained to one thread running at 6% CPU capacity while the other threads sit idle).
CodePudding user response:
Step 1: bit packing
Since your matrix only contains binary values, you can aggressively pack the bits into uint64 values so to perform a much more efficient sort then. Here is a Numba implementation:
import numpy as np
import numba as nb
@nb.njit('(uint8[:,::1],)', parallel=True)
def pack_bits(mat):
n, m = mat.shape
res = np.zeros((n, (m 63)//64), np.uint64)
for i in nb.prange(n):
for bj in range(0, m, 64):
val = np.uint64(0)
if bj 64 <= m:
# Fast case
for j in range(64):
val = np.uint64(mat[i, bj j]) << (63 - j)
else:
# Slow case (boundary)
for j in range(m - bj):
val = np.uint64(mat[i, bj j]) << (63 - j)
res[i, bj//64] = val
return res
@nb.njit('(uint64[:,::1], int_)', parallel=True)
def unpack_bits(mat, m):
n = mat.shape[0]
assert mat.shape[1] == (m 63)//64
res = np.zeros((n, m), np.uint64)
for i in nb.prange(n):
for bj in range(0, m, 64):
val = np.uint64(mat[i, bj//64])
if bj 64 <= m:
# Fast case
for j in range(64):
res[i, bj j] = np.uint8((val >> (63 - j)) & 1)
else:
# Slow case (boundary)
for j in range(m - bj):
res[i, bj j] = np.uint8((val >> (63 - j)) & 1)
return res
The np.unique
function can be called on the much smaller packed array like in the initial code (except the resulting sorted array is a packed one and need to be unpacked). Since you do not need the indices, it is better not to compute it. Thus, return_index=True
can be removed. Additionally, only the required values can be unpacked (unpacking is a bit more expensive than packing because writing a big matrix is more expensive than reading an existing one).
if __name__ == '__main__':
threshold = 2
n, m = 1800000, 108
mat = create_mat(n, m, 8)
print(f'Time: {datetime.datetime.now()}')
packed_mat = pack_bits(mat)
duplicate_packed_rows, duplicate_counts = np.unique(packed_mat, axis=0, return_counts=True)
duplicate_indices = [int(x) for x in np.argwhere(duplicate_counts >= threshold)]
print(f'Time: {datetime.datetime.now()}')
print(f'Duplicate rows: {len(duplicate_rows)} Sample inds: {duplicate_indices[0:5]} Sample counts: {duplicate_counts[0:5]}')
print(f'Sample rows:')
print(unpack_bits(duplicate_packed_rows[0:5], m))
Step 2: np.unique
optimizations
The np.unique
call is sub-optimal as it performs multiple expensive internal sorting steps. Not all of them are needed in your specific case and some step can be optimized.
A more efficient implementation consists in sorting the last column during a first step, then sorting the previous column, and so on until the first column is sorted similar to what a Radix sort does. Note that the last column can be sorted using a non-stable algorithm (generally faster) but the others need a stable one. This method is still sub-optimal as argsort
calls are slow and the current implementation does not use multiple threads yet. Unfortunately, Numpy does not proving any efficient way to sort rows of a 2D array yet. While it is possible to reimplement this in Numba, this is cumbersome, a bit tricky to do and bug prone. Not to mention Numba introduce some overheads compared to a native C/C code. Once sorted, the unique/duplicate rows can be tracked and counted. Here is an implementation:
def sort_lines(mat):
n, m = mat.shape
for i in range(m):
kind = 'stable' if i > 0 else None
mat = mat[np.argsort(mat[:,m-1-i], kind=kind)]
return mat
@nb.njit('(uint64[:,::1],)', parallel=True)
def find_duplicates(sorted_mat):
n, m = sorted_mat.shape
assert m >= 0
isUnique = np.zeros(n, np.bool_)
uniqueCount = 1
if n > 0:
isUnique[0] = True
for i in nb.prange(1, n):
isUniqueVal = False
for j in range(m):
isUniqueVal |= sorted_mat[i, j] != sorted_mat[i-1, j]
isUnique[i] = isUniqueVal
uniqueCount = isUniqueVal
uniqueValues = np.empty((uniqueCount, m), np.uint64)
duplicateCounts = np.zeros(len(uniqueValues), np.uint64)
cursor = 0
for i in range(n):
cursor = isUnique[i]
for j in range(m):
uniqueValues[cursor-1, j] = sorted_mat[i, j]
duplicateCounts[cursor-1] = 1
return uniqueValues, duplicateCounts
The previous np.unique
call can be replaced by find_duplicates(sort_lines(packed_mat))
.
Step 3: GPU-based np.unique
While implementing a fast algorithm to sort row is not easy on CPU with Numba and Numpy, one can simply use CuPy to do that on the GPU assuming a Nvidia GPU is available and CUDA is installed (as well as CuPy). This solution has the benefit of being simple and significantly more efficient. Here is an example:
import cupy as cp
def cupy_sort_lines(mat):
cupy_mat = cp.array(mat)
return cupy_mat[cp.lexsort(cupy_mat.T[::-1,:])].get()
The previous sort_lines
call can be replaced by cupy_sort_lines
.
Results
Here are the timings on my machine with a 6-core i5-9600KF CPU and a Nvidia 1660 Super GPU:
Initial version: 15.541 s
Optimized packing: 0.982 s
Optimized np.unique: 0.634 s
GPU-based sorting: 0.143 s (require a Nvidia GPU)
Thus, the CPU-based optimized version is about 25 times faster and the GPU-based one is 109 times faster. Note that the sort take a significant time in all versions. Also, please note that the unpacking is not included in the benchmark (as seen in the provided code). It takes a negligible time as long as only few rows are unpacked and not all the full array (which takes roughtly ~200 ms on my machine). This last operation can be further optimized at the expense of a significantly more complex implementation.
CodePudding user response:
If you don't want to use bit packing or calling np.unique
on multidimensional array you could also create view. It's slower than any other optimized ways but rather educational:
def create_view(arr): # a, b are arrays
arr = np.ascontiguousarray(arr)
void_dt = np.dtype((np.void, arr.dtype.itemsize * arr.shape[1]))
return arr.view(void_dt).ravel()
def create_mat(n, m, nz):
rng = np.random.default_rng()
data = rng.permuted(np.full((n, m), [1] * nz [0] * (m - nz), dtype=np.uint8), axis=1)
return data
if __name__ == '__main__':
mat = create_mat(1800000, 108, 8)
mat_view = create_view(mat)
u, idx, counts = np.unique(mat_view, axis=0, return_index=True, return_counts=True)
duplicate_indices = np.flatnonzero(counts >= 2)
print(f'Duplicate inds: {duplicate_indices}')
print(f'Duplicate counts: {counts[duplicate_indices]}')
print(f'Unique rows: {len(u)}')
Note that I have fixed some code as well. create_mat
could be done in a more efficient way. Runtime is ~ 3 seconds :)
CodePudding user response:
I would suggest scipy.stats.rankdata
, with your favorite axis as axis
. Notice that with method
set to min
, the unique values in the resulting arrays give you the indices of the unique rows.