Home > Back-end >  Pandas series - how to validate each item is categorical
Pandas series - how to validate each item is categorical

Time:10-12

I am importing data which should be categorical from an externally sourced csv file into a pandas dataframe.

The first thing I want to do is to validate that the values are valid for the categorical type.

My strategy is to create an instance of CategoricalDtype and then using apply to test each value.

Question: The only way I can figure out is to test each value is in CategoricalDtype.categories.values but is there a "better" way? are there any methods I can use to achieve the same? I'm new to CategoricalDtype and it doesnt feel like this is the best way to be testing the data value.

# example of what I'm doing
import pandas as pd
from pandas.api.types import CategoricalDtype


df = pd.read_csv('data.csv')
cat = CategoricalDtype(categories=["A", "B", "C"], ordered=False)

df['data_is_valid']=df['data_field'].apply(lambda x: x in cat.categories.values)

CodePudding user response:

If need test if exist values from column data_field :

df['data_is_valid']=df['data_field'].isin(cat.categories)

If need test also categorical_dtype:

from pandas.api.types import is_categorical_dtype

df['data_is_valid']=df['data_field'].isin(cat.categories) & is_categorical_dtype(df['data_field'])

Difference is possible see in data sample:

from pandas.api.types import CategoricalDtype
from pandas.api.types import is_categorical_dtype
            
df = pd.DataFrame({ "data_field": ["A", "B", "C", "D", 'E']})

cat = CategoricalDtype(categories=["A", "B", "C"], ordered=False)

#categories match but not Categorical
df['data_is_valid1']=df['data_field'].isin(cat.categories) & is_categorical_dtype(df['data_field'])
#categories match not tested Categorical
df['data_is_valid2']=df['data_field'].isin(cat.categories)

cat_type = CategoricalDtype(categories=["A", "B", "C", 'D', 'E'], ordered=True)

#created Categorical column
df['data_field'] = df['data_field'].astype(cat_type)

#categoriesand Categorical match
df['data_is_valid3']=df['data_field'].isin(cat.categories) & is_categorical_dtype(df['data_field'])
#categories match not tested Categorical
df['data_is_valid4']=df['data_field'].isin(cat.categories)

print (df)
  data_field  data_is_valid1  data_is_valid2  data_is_valid3  data_is_valid4
0          A           False            True            True            True
1          B           False            True            True            True
2          C           False            True            True            True
3          D           False           False           False           False
4          E           False           False           False           False
  • Related