I have a list that contains a number of different labels and I would like to change the labels to something different.
Where, original = [1,1,1,2,2,2,3,3,3,3]
And I want to change each values with, modified = [1,3,8]
so the output would look be original_modified = [1,1,1,3,3,3,8,8,8,8]
So far what I have done is this for loop below:
for x, y in zip(np.unique(original), modified):
original_modified = np.where(original == x, original, y)
However, I am not getting the intended results as the output is incorrect, and I'm not quite sure as to why.
I understand I could get a way with a simple for loop with if conditions however, I am not sure if this would be a very dynamic solution.
Any help is appreciated, thanks.
CodePudding user response:
Without numpy
:
out = list(map(dict(zip({k:0 for k in original}.keys(), modified)).get, original))
>>> out
[1, 1, 1, 3, 3, 3, 8, 8, 8, 8]
Explanation
So why does it work?
{k:0 for k in original}
is a way to find the distinct values inoriginal
, in insertion order (unlikeset
where order is undefined). It is adict
where the keys are the distinct values, and the value is always 0.- once we have that, we take the
keys()
and zip with themodified
values into a dict. E.g.>>> dict(zip({k:0 for k in original}.keys(), modified)) {1: 1, 2: 3, 3: 8}
- we then use that as a map to replace the original values with
map(_the_mapping_dict_.get, original)
.
Addendum: alternatives and performance
Here are a few other ways to achieve the same result, and how long they take.
def pure_py(om):
"""Pure Python"""
original, modified = om
return list(map(dict(zip({k: 0 for k in original}.keys(), modified)).get, original))
def py_with_pd_unique(om):
"""Using a dict for replacement, but using pd.unique() to get the unique values"""
original, modified = om
return list(map(dict(zip(pd.unique(original), modified)).get, original))
def np_select(om):
"""Using np.select and assuming inputs are np.array"""
original, modified = om
return np.select([original == v for v in pd.unique(original)], modified, original)
def vect_dict_get(om):
"""Using a vectorized dict.get()"""
original, modified = om
d = dict(zip(pd.unique(original), modified))
return np.apply_along_axis(np.vectorize(d.get), 0, original)
Then:
import perfplot
from math import isqrt
def setup(n):
original = np.random.randint(0, isqrt(n), n)
modified = np.arange(len(pd.unique(original)))
return original, modified
perfplot.show(
setup=setup,
n_range=[4 ** k for k in range(4, 11)],
kernels=[
pure_py,
py_with_pd_unique,
np_select,
vect_dict_get,
],
xlabel='len(original)',
)
Conclusion: py_with_pd_unique
is the fastest through the range. For 1M elements in original
, it is almost twice as fast as the rest:
o, m = setup(1_000_000)
%timeit pure_py((o, m))
# 209 ms ± 359 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit py_with_pd_unique((o, m))
# 108 ms ± 217 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
CodePudding user response:
I found two problems in your code.
In your loop, you should read from
original_modified
instead of starting fromoriginal
again at each iteration.You reversed the last two arguments to
np.where()
.
This code works:
original_modified = original
for x, y in zip(np.unique(original), modified):
original_modified = np.where(original == x, y, original_modified)
PS: As @PierreD pointed out, np.unique()
might not be the right choice, since it sorts its results. If you need to preserve the order in which elements first appear in original
, use pd.unique()
instead.
CodePudding user response:
Are you really constrained in using np.where
? If not, an alternative solution might be:
import numpy as np
original = np.array([1,1,1,2,2,2,3,3,3,3])
modified = original.copy()
d = {2: 3, 3: 8}
for k, v in d.items():
modified[original == k] = v
print(modified)
# array([1, 1, 1, 3, 3, 3, 8, 8, 8, 8])
CodePudding user response:
np.where
will return a boolean index to where the condition is satisfied.
Indexing assignment should do this:
import numpy as np
original = np.array([1,1,1,2,2,2,3,3,3,3])
out = np.empty(original.shape, dtype=int)
modified = [1, 3, 8]
for x, y in zip(np.unique(original), modified):
out[np.where(original == x)] = y
print(out)
# [1 1 1 3 3 3 8 8 8 8]