I am importing data which should be categorical from an externally sourced csv file into a pandas dataframe.
The first thing I want to do is to validate that the values are valid for the categorical type.
My strategy is to create an instance of CategoricalDtype
and then using apply
to test each value.
Question: The only way I can figure out is to test each value is in
CategoricalDtype.categories.values
but is there a "better" way? are there any methods I can use to achieve the same? I'm new to CategoricalDtype
and it doesnt feel like this is the best way to be testing the data value.
# example of what I'm doing
import pandas as pd
from pandas.api.types import CategoricalDtype
df = pd.read_csv('data.csv')
cat = CategoricalDtype(categories=["A", "B", "C"], ordered=False)
df['data_is_valid']=df['data_field'].apply(lambda x: x in cat.categories.values)
CodePudding user response:
If need test if exist values from column data_field
:
df['data_is_valid']=df['data_field'].isin(cat.categories)
If need test also categorical_dtype
:
from pandas.api.types import is_categorical_dtype
df['data_is_valid']=df['data_field'].isin(cat.categories) & is_categorical_dtype(df['data_field'])
Difference is possible see in data sample:
from pandas.api.types import CategoricalDtype
from pandas.api.types import is_categorical_dtype
df = pd.DataFrame({ "data_field": ["A", "B", "C", "D", 'E']})
cat = CategoricalDtype(categories=["A", "B", "C"], ordered=False)
#categories match but not Categorical
df['data_is_valid1']=df['data_field'].isin(cat.categories) & is_categorical_dtype(df['data_field'])
#categories match not tested Categorical
df['data_is_valid2']=df['data_field'].isin(cat.categories)
cat_type = CategoricalDtype(categories=["A", "B", "C", 'D', 'E'], ordered=True)
#created Categorical column
df['data_field'] = df['data_field'].astype(cat_type)
#categoriesand Categorical match
df['data_is_valid3']=df['data_field'].isin(cat.categories) & is_categorical_dtype(df['data_field'])
#categories match not tested Categorical
df['data_is_valid4']=df['data_field'].isin(cat.categories)
print (df)
data_field data_is_valid1 data_is_valid2 data_is_valid3 data_is_valid4
0 A False True True True
1 B False True True True
2 C False True True True
3 D False False False False
4 E False False False False