How to create your own category type codes?-CodePudding

I have this housing dataframe that I'm working with in Python. One of the columns contains house "grade" data, which is based on an actual official county grading system. The houses are graded on a scale of 3 - 12 with descriptions like "good", "average", "excellent", etc attached to them. The data in the column however is in a string format with the number grade coming first then a space followed by the description.

So for example one entry might read as "7 Average". (A better visualization here)

I want to first eliminate the number from the string, which I can do. I want to then convert each string entry into a category type, which I can also do. However, when I use .cat.code to have the category codes automatically be generated, is there a way to use the original county scale (3-12) as my codes? I suppose I could use hot-encoding, but I feel that will just look messy and unprofessional.

EDIT: I just got rid of the description and kept the number and converted to int which worked. However, I still would like to see if there's an answer to my original question.

CodePudding user response：

Since your data is already inherently ordered, we can make it ordered categorical, and then rename the categories:

df.grade = pd.Categorical(df.grade, ordered=True)
df.grade = df.grade.cat.rename_categories(df.grade.cat.categories.str[2:])

Given:

df = pd.DataFrame({'grade':['6 Low Average', '5 Bad', '8 Good', '7 Average', '7 Average']})
print(df.sort_values('grade'))


# Output:
           grade
1          5 Bad
0  6 Low Average
3      7 Average
4      7 Average
2         8 Good

Doing:

df.grade = pd.Categorical(df.grade, ordered=True)
df.grade = df.grade.cat.rename_categories(df.grade.cat.categories.str[2:])
print(df.sort_values('grade'))
df.sort_values('grade').grade.cat.codes

# Output:
         grade
1          Bad
0  Low Average
3      Average
4      Average
2         Good

1    0
0    1
3    2
4    2
2    3
dtype: int8

# If they weren't categorical ordered it'd look like:
#          grade
# 3      Average
# 4      Average
# 1          Bad
# 2         Good
# 0  Low Average

CodePudding user response：

It's not possible to define the codes of a Categorical dtype. When you try to set the codes, this can be raised 2 kinds of exceptions:

>>> pd.Categorical.from_codes(...)
ValueError: codes need to be between -1 and len(categories)-1

>>> mycat.codes = [...]
AttributeError: can't set attribute

Categorical dtype uses -1 for missing values and 0 from N to index categories (generally in lexicographical order for string values).

The best you can do is to create the category like this:

data = df['grade'].str.split(' ', 1, expand=True).astype({0: int})
grade = pd.CategoricalDtype(data.drop_duplicates(0).sort_values(0)[1], ordered=True, name='Grade')
df['grade'] = data[1].astype(grade)

Output:

>>> grade
CategoricalDtype(categories=['Low Average', 'Average', 'Good', 'Excellent'], ordered=True)

>>> df['grade']
0    Low Average
1        Average
2           Good
3      Excellent
Name: grade, dtype: category
Categories (4, object): ['Low Average' < 'Average' < 'Good' < 'Excellent']

>>> df['grade'].cat.codes
0    0
1    1
2    2
3    3
dtype: int8

CodePudding user response：

Can be done this way:

df = pd.DataFrame({'nr_grade': ['7 Average', '8 Good', '8 Good']})
df[['nr', 'grade']] = df['nr_grade'].str.split(' ', expand=True)
df['grade'] = pd.Categorical(df.grade, categories=df.grade.unique())
df.info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   nr_grade  3 non-null      object  
 1   nr        3 non-null      object  
 2   grade     3 non-null      category
dtypes: category(1), object(2)
memory usage: 303.0  bytes