Home > Blockchain >  How to keep a fixed size of unique values in random positions in an array while replacing others wit
How to keep a fixed size of unique values in random positions in an array while replacing others wit

Time:09-30

This can be a very simple question as I am still exploring Python. And for this issue I use numpy. Let's say I have a 2D array like this:

test_array = np.array([[0,0,0,0,0],
                      [1,1,1,1,1],
                      [0,0,0,0,0],
                      [2,2,2,4,4],
                      [4,4,4,2,2],
                      [0,0,0,0,0]])
print("existing classes:", np.unique(test_array))
# "existing classes: [0 1 2 4]"

Now I want to keep a fixed size (e.g. 2 values) in each class that != 0 (in this case two 1s, two 2s, and two 4s) and replace the rest with 0. Where the value being replaced is random with each run (or from a seed).

For example, with run 1 I will have

([[0,0,0,0,0],
[1,0,0,1,0],
[0,0,0,0,0],
[2,0,0,0,4],
[4,0,0,2,0],
[0,0,0,0,0]])

with another run it might be

([[0,0,0,0,0],
[1,1,0,0,0],
[0,0,0,0,0],
[2,0,2,0,4],
[4,0,0,0,0],
[0,0,0,0,0]])

etc. Could anyone help me with this?

CodePudding user response:

Here is my not-so-elegant solution:

def unique(arr, num=2, seed=None):
    np.random.seed(seed)
    vals = {}
    for i, row in enumerate(arr):
        for j, val in enumerate(row):
            if val in vals and val != 0:
                vals[val].append((i, j))
            elif val != 0:
                vals[val] = [(i, j)]
    new = np.zeros_like(arr)
    for val in vals:
        np.random.shuffle(vals[val])
        while len(vals[val]) > num:
            vals[val].pop()
        for row, col in vals[val]:
            new[row,col] = val
    return new

CodePudding user response:

My strategy is

  1. Create a new array initialized to all zeros
  2. Find the elements in each class
  3. For each class
    • Randomly sample two of elements to keep
    • Set those elements of the new array to the class value

The trick is keeping the shape of the indexes appropriate so you retain the shape of the original array.

import numpy as  np
test_array = np.array([[0,0,0,0,0],
                      [1,1,1,1,1],
                      [0,0,0,0,0],
                      [2,2,2,4,4],
                      [4,4,4,2,2],
                      [0,0,0,0,0]])

def sample_classes(arr, n_keep=2, random_state=42):
    classes, counts = np.unique(test_array, return_counts=True)
    rng = np.random.default_rng(random_state)
    out = np.zeros_like(arr)
    for klass, count in zip(classes, counts):
        # Find locations of the class elements
        indexes = np.nonzero(arr == klass)
        # Sample up to n_keep elements of the class
        keep_idx = rng.choice(count, n_keep, replace=False)
        # Select the kept elements and reformat for indexing the output array and retaining its shape
        keep_idx_reshape = tuple(ind[keep_idx] for ind in indexes)
        out[keep_idx_reshape] = klass
    return out

You can use it like

In [3]: sample_classes(test_array)                                                                                                                                                                         [3/1174]
Out[3]:
array([[0, 0, 0, 0, 0],
       [0, 1, 1, 0, 0],
       [0, 0, 0, 0, 0],
       [2, 0, 0, 4, 0],
       [4, 0, 0, 2, 0],
       [0, 0, 0, 0, 0]])

In [4]: sample_classes(test_array, n_keep=3)
Out[4]:
array([[0, 0, 0, 0, 0],
       [1, 0, 1, 1, 0],
       [0, 0, 0, 0, 0],
       [0, 2, 0, 4, 0],
       [4, 4, 0, 2, 2],
       [0, 0, 0, 0, 0]])

In [5]: sample_classes(test_array, random_state=88)
Out[5]:
array([[0, 0, 0, 0, 0],
       [0, 0, 1, 1, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [4, 0, 4, 2, 2],
       [0, 0, 0, 0, 0]])

In [6]: sample_classes(test_array, random_state=88, n_keep=4)
Out[6]:
array([[0, 0, 0, 0, 0],
       [0, 1, 1, 1, 1],
       [0, 0, 0, 0, 0],
       [2, 2, 0, 4, 4],
       [4, 4, 0, 2, 2],
       [0, 0, 0, 0, 0]])

CodePudding user response:

The following should be O(n log n) in array size

def keep_k_per_class(data,k,rng):
    out = np.zeros_like(data)
    unq,cnts = np.unique(data,return_counts=True)
    assert (cnts >= k).all()
    # calculate class boundaries from class sizes
    CNTS = cnts.cumsum()
    # indirectly group classes together by partial sorting
    idx = data.ravel().argpartition(CNTS[:-1])
    # the following lines implement simultaneous drawing without replacement
    # from all classes

    # lower boundaries of intervals to draw random numbers from
    # for each class they start with the lower class boundary 
    # and from there grow one by one - together with the
    # swapping out below this implements "without replacement"
    lb = np.add.outer(np.arange(k),CNTS-cnts)
    pick = rng.integers(lb,CNTS,lb.shape)
    for l,p in zip(lb,pick):
        # populate output array
        out.ravel()[idx[p]] = unq
        # swap out used indices so still available ones occupy a linear
        # range (per class)
        idx[p] = idx[l]
    return out

Examples:

rng = np.random.default_rng()
>>> 
>>> keep_k_per_class(test_array,2,rng)
array([[0, 0, 0, 0, 0],
       [1, 1, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [2, 0, 2, 0, 4],
       [0, 4, 0, 0, 0],
       [0, 0, 0, 0, 0]])
>>> keep_k_per_class(test_array,2,rng)
array([[0, 0, 0, 0, 0],
       [1, 1, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 2, 0, 0, 0],
       [4, 0, 4, 0, 2],
       [0, 0, 0, 0, 0]])

and a large one

>>> BIG = np.add.outer(np.tile(test_array,(100,100)),np.arange(0,500,5))
>>> BIG.size
30000000
>>> res = keep_k_per_class(BIG,30,rng)
### takes ~4 sec

### check
>>> np.unique(np.bincount(res.ravel()),return_counts=True)
(array([       0,       30, 29988030]), array([100, 399,   1]))
  • Related