I have a dataset with a categorical variable that is not nicely coded. The same category appears sometimes with upper case letters and sometimes with lower case (and several variations of it). Since I have a large dataset, I would like to harmonize the categories taking advantage of the categorical dtype - therefore exclude any replace
solution. The only solutions I found are this and this, but I feel they implicitly make use of replace.
I report a toy example below and the solutions I tried
from pandas import Series
# Create dataset
df = Series(["male", "female","Male", "FEMALE", "MALE", "MAle"], dtype="category", name = "NEW_TEST")
# Define the old, the "new" and the desired categories
original_categories = list(df.cat.categories)
standardised_categories = list(map(lambda x: x.lower(), df.cat.categories))
desired_new_cat = list(set(standardised_categories))
# Failed attempt to change categories
df.cat.categories = standardised_categories
df = df.cat.rename_categories(standardised_categories)
# Error message: Categorical categories must be unique
CodePudding user response:
You shouldn't try to harmonize after converting to category. This renders the use of a Category pointless as one category per exact string will be created.
You can instead harmonize the case with str.capitalize
, then convert to categorical:
s = (pd.Series(["male", "female","Male", "FEMALE", "MALE", "MAle"],
name = "NEW_TEST")
.str.capitalize().astype('category')
)
If you already have a category, convert back to string and start over:
s = s.astype(str).str.capitalize().astype('category')
Output:
0 Male
1 Female
2 Male
3 Female
4 Male
5 Male
Name: NEW_TEST, dtype: category
Categories (2, object): ['Female', 'Male']
CodePudding user response:
Given the Series df
that OP creates in the code sample shared in the question, one can approach would be to use pandas.Series.str.lower
as .astype("category")
as follows
df = df.str.lower().astype("category")
[Out]:
0 male
1 female
2 male
3 female
4 male
5 male
If one prints the dtype
, one gets
CategoricalDtype(categories=['female', 'male'], ordered=False)