I have a pandas data frame with both unordered and ordered categorical columns (as well as columns with other data types). I want to select only the ordered categorical columns.
Here's an example dataset:
import pandas as pd
import numpy.random as npr
n_obs = 20
eye_colors = ["blue", "brown"]
people = pd.DataFrame({
"eye_color": npr.choice(eye_colors, size=n_obs),
"age": npr.randint(20, 60, size=n_obs)
})
people["age_group"] = pd.cut(people["age"], [20, 30, 40, 50, 60], right=False)
people["eye_color"] = pd.Categorical(people["eye_color"], eye_colors)
Here, eye_color
is an unordered categorical column, age_group
is an ordered categorical column, and age
is numeric. I want just the age_group
column.
I can select all categorical columns with .select_dtypes()
.
categories = people.select_dtypes("category")
I could use a list comprehension with the .cat.ordered
property to then limit this to only ordered categories.
categories[[col for col in categories.columns if categories[col].cat.ordered]]
This is dreadfully complicated code, so it feels like there must be a better way.
What's the idiomatic way of selecting only ordered columns from a dataframe?
CodePudding user response:
You can iterate directly over the dtypes and return a boolean mask to avoid having to unnecessarily copy the underlying data until you are ready to subset:
>>> categorical_ordered = [isinstance(d, pd.CategoricalDtype) and d.ordered for d in people.dtypes]
>>> people.loc[:, categorical_ordered].head()
age_group
0 [30, 40)
1 [20, 30)
2 [50, 60)
3 [30, 40)
4 [20, 30)
You can also use is_categorical_dtype
as recommended by @richardec in the comments, or simply perform a comparison with the string representation of the dtype.
>>> from pandas.api.types import is_categorical_dtype
>>> [isinstance(d, pd.CategoricalDtype) and d.ordered for d in people.dtypes]
[False, False, True]
>>> [is_categorical_dtype(d) and d.ordered for d in people.dtypes]
[False, False, True]
>>> [d == 'category' and d.ordered for d in people.dtypes]
[False, False, True]
You can also abstract away the for-loop
by using .apply
>>> people.dtypes.apply(lambda d: d == 'category' and d.ordered)
eye_color False
age False
age_group True
dtype: bool
>>> people.loc[:, people.dtypes.apply(lambda d: d == 'category' and d.ordered)]
age_group
0 [20, 30)
1 [40, 50)
2 [20, 30)
3 [40, 50)
...
CodePudding user response:
One option is with getattr
; I'd pick a list comprehension over this though:
people.loc[:, people.apply(getattr, args=('cat',None))
.apply(getattr, args=('ordered', False))]
age_group
0 [40, 50)
1 [50, 60)
2 [30, 40)
3 [40, 50)
4 [30, 40)
5 [40, 50)
6 [40, 50)
7 [20, 30)
8 [20, 30)
9 [20, 30)
10 [40, 50)
11 [20, 30)
12 [50, 60)
13 [40, 50)
14 [40, 50)
15 [20, 30)
16 [50, 60)
17 [30, 40)
18 [50, 60)
19 [40, 50)