Home > database >  How to quickly randomly update values in an np array?
How to quickly randomly update values in an np array?

Time:11-08

So I have a large 3D array (~ 2000 x 1000 x 1000). I want to update each value in the array to a random integer value between 1 and the current max such that all values = x are updated to the same random integer. I want to keep zeros unchanged. Also there can't be any repeats, i.e. different values in the original array can't be updated to the same random int. The values are currently in a continuous range between 0 and 9000. There are quite a lot of values in the array;

np.amax(arr) #output = 9000

So tried the method below...

max_v = np.amax(arr)
vlist = []
for l in range(1,max_v): vlist.append(l)
for l in tqdm(range(1,max_v)):
    m = random.randint(1,len(vlist))
    n = vlist[m]
    arr = np.where(arr == l, n, arr)
    vlist.remove(n)

My current code takes about 13 s per iteration with 9000 itertions (for the first few iterations at least which is too slow). I've thought about parallelisation with concurrent.futures but i'm sure it's likely i've missed something obvious here XD

CodePudding user response:

If your current values are in a continuous range, and you want another continuous range, you're in luck! At that point, you aren't really generating 2 billion random numbers: you're just permuting 9000 or so integers. For example:

arr = np.random.randint(9001, size=(10, 20, 20))
p = np.arange(arr.max(None)   1)
np.random.shuffle(p)
arr = p[arr]

The replacement values do not have to start with zero, but if you plan on doing this iteratively, you will have to subtract off the offset before using arr as an index into p.

CodePudding user response:

As suggested by Mad Physicist, here's my almost identical solution:

from sys import getsizeof
import numpy as np

# create a new-style random generator
rng = np.random.default_rng()

# takes ~20 seconds, ~60 secs with legacy generator
X = rng.integers(9001, size=(2000, 1000, 1000), dtype=np.uint16)

# output: 3.73 GiB, uint16 takes 1/4 space of the default int64
print(f"{getsizeof(X) / 2**30:.2f} GiB")

# generate a permutation, converting to same datatype makes slightly faster
p = rng.permutation(np.max(X) 1).astype(X.dtype)

# iterate applying permutation, takes ~10 seconds in total
for i in range(len(X)):
    X[i] = p[X[i]]

I'm iterating while applying the permutation, to reduce transient memory demands. it will only need one slice of the first dimension at a time (~2MiB) rather than trying to completely allocate a new copy again.

  • Related