Home > OS >  How to select ordered categorical columns from a pandas dataframe?
How to select ordered categorical columns from a pandas dataframe?

Time:05-05

I have a pandas data frame with both unordered and ordered categorical columns (as well as columns with other data types). I want to select only the ordered categorical columns.

Here's an example dataset:

import pandas as pd
import numpy.random as npr

n_obs = 20
eye_colors = ["blue", "brown"]
people = pd.DataFrame({
    "eye_color": npr.choice(eye_colors, size=n_obs),
    "age": npr.randint(20, 60, size=n_obs)
})
people["age_group"] = pd.cut(people["age"], [20, 30, 40, 50, 60], right=False)
people["eye_color"] = pd.Categorical(people["eye_color"], eye_colors)

Here, eye_color is an unordered categorical column, age_group is an ordered categorical column, and age is numeric. I want just the age_group column.

I can select all categorical columns with .select_dtypes().

categories = people.select_dtypes("category")

I could use a list comprehension with the .cat.ordered property to then limit this to only ordered categories.

categories[[col for col in categories.columns if categories[col].cat.ordered]]

This is dreadfully complicated code, so it feels like there must be a better way.

What's the idiomatic way of selecting only ordered columns from a dataframe?

CodePudding user response:

You can iterate directly over the dtypes and return a boolean mask to avoid having to unnecessarily copy the underlying data until you are ready to subset:

>>> categorical_ordered = [isinstance(d, pd.CategoricalDtype) and d.ordered for d in people.dtypes]

>>> people.loc[:, categorical_ordered].head()
  age_group
0  [30, 40)
1  [20, 30)
2  [50, 60)
3  [30, 40)
4  [20, 30)

You can also use is_categorical_dtype as recommended by @richardec in the comments, or simply perform a comparison with the string representation of the dtype.

>>> from pandas.api.types import is_categorical_dtype
>>> [isinstance(d, pd.CategoricalDtype) and d.ordered for d in people.dtypes]
[False, False, True]

>>> [is_categorical_dtype(d) and d.ordered for d in people.dtypes]
[False, False, True]

>>> [d == 'category' and d.ordered for d in people.dtypes]
[False, False, True]

You can also abstract away the for-loop by using .apply

>>> people.dtypes.apply(lambda d: d == 'category' and d.ordered)
eye_color    False
age          False
age_group     True
dtype: bool

>>> people.loc[:, people.dtypes.apply(lambda d: d == 'category' and d.ordered)]
   age_group
0   [20, 30)
1   [40, 50)
2   [20, 30)
3   [40, 50)
...

CodePudding user response:

One option is with getattr; I'd pick a list comprehension over this though:

people.loc[:, people.apply(getattr, args=('cat',None))
                    .apply(getattr, args=('ordered', False))]

   age_group
0   [40, 50)
1   [50, 60)
2   [30, 40)
3   [40, 50)
4   [30, 40)
5   [40, 50)
6   [40, 50)
7   [20, 30)
8   [20, 30)
9   [20, 30)
10  [40, 50)
11  [20, 30)
12  [50, 60)
13  [40, 50)
14  [40, 50)
15  [20, 30)
16  [50, 60)
17  [30, 40)
18  [50, 60)
19  [40, 50)
  • Related