Suppose the following numpy array
>>> z = np.zeros(3, dtype={'names': ("id", "dim1", "cnt1"), 'formats': ('i8', 'S3', 'u8')})
>>> z
array([(0, '', 0L), (0, '', 0L), (0, '', 0L)],
dtype=[('id', '<i8'), ('dim1', 'S3'), ('cnt1', '<u8')])
>>> z["dim1"] = ["foo", "foo", "bar"]
>>>
>>>
>>> z["cnt1"] = [1,2,3]
>>> z
array([(0, 'foo', 1L), (0, 'foo', 2L), (0, 'bar', 3L)],
dtype=[('id', '<i8'), ('dim1', 'S3'), ('cnt1', '<u8')])
And I want to map each unique dim1
to an id.
One way to do it with a for loop on unique dim1 values is the following:
>>> unique_groups = np.unique(z["dim1"])
>>> groups = z["dim1"]
>>> for idx, ug in enumerate(unique_groups):
... z["id"][ug == groups] = idx
...
>>> z
array([(1, 'foo', 1L), (1, 'foo', 2L), (0, 'bar', 3L)],
dtype=[('id', '<i8'), ('dim1', 'S3'), ('cnt1', '<u8')])
>>>
I was wondering if there is way to do it without a for loop, with a vector operation instead. I tried to perform it with vectorizing a function like shown below
>>> def map_column(key):
... return m[key]
...
>>> m
{'foo': 1, 'bar': 0}
>>> f = np.vectorize(map_column, otypes=[str])
>>> f(z["dim1"])
array(['1', '1', '0'],
dtype='|S1')
Is any other more efficient way to do it ? And between the 2 ways which is considered to be better performance wise ?
CodePudding user response:
You can use .searchsorted()
:
In [2]: unique_groups = np.unique(z["dim1"])
In [3]: z["id"] = unique_groups.searchsorted(z["dim1"])
In [4]: z
Out[4]:
array([(1, b'foo', 1), (1, b'foo', 2), (0, b'bar', 3)],
dtype=[('id', '<i8'), ('dim1', 'S3'), ('cnt1', '<u8')])
Not sure about performance, but probably not much better.