How to replace values in array numpy using np.where for loop?-CodePudding

I have a list that contains a number of different labels and I would like to change the labels to something different.

Where, original = [1,1,1,2,2,2,3,3,3,3]

And I want to change each values with, modified = [1,3,8]

so the output would look be original_modified = [1,1,1,3,3,3,8,8,8,8]

So far what I have done is this for loop below:

for x, y in zip(np.unique(original), modified):
    original_modified = np.where(original == x, original, y)

However, I am not getting the intended results as the output is incorrect, and I'm not quite sure as to why.

I understand I could get a way with a simple for loop with if conditions however, I am not sure if this would be a very dynamic solution.

Any help is appreciated, thanks.

CodePudding user response：

Without numpy:

out = list(map(dict(zip({k:0 for k in original}.keys(), modified)).get, original))

>>> out
[1, 1, 1, 3, 3, 3, 8, 8, 8, 8]

Explanation

So why does it work?

{k:0 for k in original} is a way to find the distinct values in original, in insertion order (unlike set where order is undefined). It is a dict where the keys are the distinct values, and the value is always 0.
once we have that, we take the keys() and zip with the modified values into a dict. E.g.
```
>>> dict(zip({k:0 for k in original}.keys(), modified))
{1: 1, 2: 3, 3: 8}
```
we then use that as a map to replace the original values with map(_the_mapping_dict_.get, original).

Addendum: alternatives and performance

Here are a few other ways to achieve the same result, and how long they take.

def pure_py(om):
    """Pure Python"""
    original, modified = om
    return list(map(dict(zip({k: 0 for k in original}.keys(), modified)).get, original))

def py_with_pd_unique(om):
    """Using a dict for replacement, but using pd.unique() to get the unique values"""
    original, modified = om
    return list(map(dict(zip(pd.unique(original), modified)).get, original))

def np_select(om):
    """Using np.select and assuming inputs are np.array"""
    original, modified = om
    return np.select([original == v for v in pd.unique(original)], modified, original)

def vect_dict_get(om):
    """Using a vectorized dict.get()"""
    original, modified = om
    d = dict(zip(pd.unique(original), modified))
    return np.apply_along_axis(np.vectorize(d.get), 0, original)

Then:

import perfplot
from math import isqrt

def setup(n):
    original = np.random.randint(0, isqrt(n), n)
    modified = np.arange(len(pd.unique(original)))
    return original, modified

perfplot.show(
    setup=setup,
    n_range=[4 ** k for k in range(4, 11)],
    kernels=[
        pure_py,
        py_with_pd_unique,
        np_select,
        vect_dict_get,
    ],
    xlabel='len(original)',
)

Conclusion: py_with_pd_unique is the fastest through the range. For 1M elements in original, it is almost twice as fast as the rest:

o, m = setup(1_000_000)

%timeit pure_py((o, m))
# 209 ms ± 359 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit py_with_pd_unique((o, m))
# 108 ms ± 217 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

CodePudding user response：

I found two problems in your code.

In your loop, you should read from original_modified instead of starting from original again at each iteration.
You reversed the last two arguments to np.where().

This code works:

original_modified = original
for x, y in zip(np.unique(original), modified):
    original_modified = np.where(original == x, y, original_modified)

PS: As @PierreD pointed out, np.unique() might not be the right choice, since it sorts its results. If you need to preserve the order in which elements first appear in original, use pd.unique() instead.

CodePudding user response：

Are you really constrained in using np.where? If not, an alternative solution might be:

import numpy as np
original = np.array([1,1,1,2,2,2,3,3,3,3])
modified = original.copy()
d = {2: 3, 3: 8}
for k, v in d.items():
    modified[original == k] = v
print(modified)
# array([1, 1, 1, 3, 3, 3, 8, 8, 8, 8])

CodePudding user response：

np.where will return a boolean index to where the condition is satisfied.

Indexing assignment should do this:

import numpy as np

original = np.array([1,1,1,2,2,2,3,3,3,3])
out = np.empty(original.shape, dtype=int)
modified = [1, 3, 8]

for x, y in zip(np.unique(original), modified):
    out[np.where(original == x)] = y

print(out)
# [1 1 1 3 3 3 8 8 8 8]