How to map unique groups to ids without looping-CodePudding

Suppose the following numpy array

>>> z = np.zeros(3, dtype={'names': ("id", "dim1", "cnt1"), 'formats': ('i8', 'S3', 'u8')})
>>> z
array([(0, '', 0L), (0, '', 0L), (0, '', 0L)],
      dtype=[('id', '<i8'), ('dim1', 'S3'), ('cnt1', '<u8')])
>>> z["dim1"] = ["foo", "foo", "bar"]
>>>
>>>
>>> z["cnt1"] = [1,2,3]
>>> z
array([(0, 'foo', 1L), (0, 'foo', 2L), (0, 'bar', 3L)],
      dtype=[('id', '<i8'), ('dim1', 'S3'), ('cnt1', '<u8')])

And I want to map each unique dim1 to an id. One way to do it with a for loop on unique dim1 values is the following:

>>> unique_groups = np.unique(z["dim1"])
>>> groups = z["dim1"]
>>> for idx, ug in enumerate(unique_groups):
...     z["id"][ug == groups] = idx
...
>>> z
array([(1, 'foo', 1L), (1, 'foo', 2L), (0, 'bar', 3L)],
      dtype=[('id', '<i8'), ('dim1', 'S3'), ('cnt1', '<u8')])
>>>

I was wondering if there is way to do it without a for loop, with a vector operation instead. I tried to perform it with vectorizing a function like shown below

>>> def map_column(key):
...     return m[key]
...
>>> m
{'foo': 1, 'bar': 0}
>>> f = np.vectorize(map_column, otypes=[str])
>>> f(z["dim1"])
array(['1', '1', '0'],
      dtype='|S1')

Is any other more efficient way to do it ? And between the 2 ways which is considered to be better performance wise ?

CodePudding user response：

You can use .searchsorted():

In [2]: unique_groups = np.unique(z["dim1"])

In [3]: z["id"] = unique_groups.searchsorted(z["dim1"])

In [4]: z
Out[4]:
array([(1, b'foo', 1), (1, b'foo', 2), (0, b'bar', 3)],
      dtype=[('id', '<i8'), ('dim1', 'S3'), ('cnt1', '<u8')])

Not sure about performance, but probably not much better.