Redefine categories of a categorical variable ignoring upper and lower case-CodePudding

I have a dataset with a categorical variable that is not nicely coded. The same category appears sometimes with upper case letters and sometimes with lower case (and several variations of it). Since I have a large dataset, I would like to harmonize the categories taking advantage of the categorical dtype - therefore exclude any replace solution. The only solutions I found are this and this, but I feel they implicitly make use of replace.

I report a toy example below and the solutions I tried

from pandas import Series

# Create dataset
df = Series(["male", "female","Male", "FEMALE", "MALE", "MAle"], dtype="category", name = "NEW_TEST")

# Define the old, the "new" and the desired categories
original_categories = list(df.cat.categories)
standardised_categories = list(map(lambda x: x.lower(), df.cat.categories)) 
desired_new_cat = list(set(standardised_categories))

# Failed attempt to change categories   
df.cat.categories = standardised_categories
df = df.cat.rename_categories(standardised_categories)
# Error message: Categorical categories must be unique

CodePudding user response：

You shouldn't try to harmonize after converting to category. This renders the use of a Category pointless as one category per exact string will be created.

You can instead harmonize the case with str.capitalize, then convert to categorical:

s = (pd.Series(["male", "female","Male", "FEMALE", "MALE", "MAle"],
               name = "NEW_TEST")
       .str.capitalize().astype('category')
     )

If you already have a category, convert back to string and start over:

s = s.astype(str).str.capitalize().astype('category')

Output:

0      Male
1    Female
2      Male
3    Female
4      Male
5      Male
Name: NEW_TEST, dtype: category
Categories (2, object): ['Female', 'Male']

CodePudding user response：

Given the Series df that OP creates in the code sample shared in the question, one can approach would be to use pandas.Series.str.lower as .astype("category") as follows

df = df.str.lower().astype("category")

[Out]:

0      male
1    female
2      male
3    female
4      male
5      male

If one prints the dtype, one gets

CategoricalDtype(categories=['female', 'male'], ordered=False)