Home > Software design >  How to specify encoding in pandas Categoricals
How to specify encoding in pandas Categoricals

Time:01-18

So I know that I can get pandas to do categorical encoding by, e.g., using df = pd.read_csv("_.csv", dtype="categorical"). On the resulting dataframe, I can then check df[col].cat.codes and see how the categories were encoded (in the binary case, that will be 0/1).

However, from the documentation it is not clear to me whether the order of the categories, i.e., which category is mapped to 0 and which to 1 etc., is predictable and / or controllable? For instance, what can I do if I would like to specify a desired encoding of, e.g., {"val1": 0, "val2": 1}?

CodePudding user response:

The order is defined by the order of the categories.

By default (unordered), the sorted order is used (lexicographic or numeric, numbers first if mixed types).

If you manually define the categories, then the defined order is used.

If you have NaNs/undefined categories, the code is -1.

Examples:

# automatic Categorical: letters
pd.Categorical(['B', 'A', 'A', 'C', 'D']).codes
# array([1, 0, 0, 2, 3], dtype=int8)

# manual Categorical: letters
pd.Categorical(s, categories=list('ABCDE')).codes
# array([1, 0, 0, 2, 3], dtype=int8)

# manual Categorical: custom order
pd.Categorical(s, categories=list('CDEAB')).codes
# array([4, 3, 3, 0, 1], dtype=int8)

# automatic Categorical: mixed types and NaN
pd.Categorical([9, 'B', 'A', 0, np.nan, 1]).codes
# array([ 2,  4,  3,  0, -1,  1], dtype=int8)

# manual Categorical: missing values
pd.Categorical(['B', 'A', 'A', 'C', 'D'], categories=list('CB')).codes
# array([ 1, -1, -1,  0, -1], dtype=int8)

NB. the codes are immutable, this is always -1 (NaN) and 0 to N

  • Related