Home > Enterprise >  why does np.vectorize work here when np.where throws a TypeError?
why does np.vectorize work here when np.where throws a TypeError?

Time:12-17

I have a pandas DataFrame where a column called myenum has values of either 0, 1, or 2. I am trying to translate 1s and 2s to strings and use an Enum's .name attribute to help.

I think this is a question about understanding the guts of np.where vs np.vectorize as they relate to DataFrame Series. I am curious why the attempt throws an error using np.where, yet works using np.vectorize. I would like to learn from this and better understand best vectorization practices in DataFrames.

import enum
import numpy as np
import pandas as pd

df = pd.DataFrame() # one column in this df is 'myenum', its values are either 0, 1, or 2
df['myenum'] = [0, 1, 2, 0, 0, 0, 2, 1, 0]


class MyEnum(enum.Enum):
    First = 1
    Second = 2

# this throws a TypeError - why?
df['myenum'] = np.where(
    df['myenum'] > 0,
    MyEnum(df['myenum']).name,
    ''
    )

# whereas this, which seems pretty analagous, works.  what am i missing?
def vectorize_enum_value(x):
    if x > 0:
        return  MyEnum(x).name
    return ''
vect = np.vectorize(vectorize_enum_value)
df['myenum'] = vect(df['myenum'])

CodePudding user response:

The full traceback from your where expression is:

Traceback (most recent call last):
  File "/usr/lib/python3.8/enum.py", line 641, in __new__
    return cls._value2member_map_[value]
TypeError: unhashable type: 'Series'

During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "<ipython-input-27-16f5edc71240>", line 3, in <module>
    MyEnum(df['myenum']).name,
  File "/usr/lib/python3.8/enum.py", line 339, in __call__
    return cls.__new__(cls, value)
  File "/usr/lib/python3.8/enum.py", line 648, in __new__
    if member._value_ == value:
  File "/usr/local/lib/python3.8/dist-packages/pandas/core/generic.py", line 1537, in __nonzero__
    raise ValueError(
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

It's produced by giving the whole series to MyEnum:

In [30]: MyEnum(df['myenum'])
Traceback (most recent call last):
  File "/usr/lib/python3.8/enum.py", line 641, in __new__
    return cls._value2member_map_[value]
TypeError: unhashable type: 'Series'
...

The problem isn't with the where at all.

The where works fine if we provide it with a valid list of strings:

In [33]: np.where(
    ...:     df['myenum'] > 0,
    ...:     [vectorize_enum_value(x) for x in df['myenum']],
    ...:     ''
    ...:     )
Out[33]: 
array(['', 'First', 'Second', '', '', '', 'Second', 'First', ''],
      dtype='<U6')

That 2nd argument, the list comprehension is basically the same as the vectorize.

where is a function; Python evaluates function arguments before passing them in. So each argument has to work. where is not an iterator, like apply or even vectorize.

  • Related