I have this housing dataframe that I'm working with in Python. One of the columns contains house "grade" data, which is based on an actual official county grading system. The houses are graded on a scale of 3 - 12 with descriptions like "good", "average", "excellent", etc attached to them. The data in the column however is in a string format with the number grade coming first then a space followed by the description.
So for example one entry might read as "7 Average". (A better visualization here)
I want to first eliminate the number from the string, which I can do. I want to then convert each string entry into a category type, which I can also do. However, when I use .cat.code
to have the category codes automatically be generated, is there a way to use the original county scale (3-12) as my codes? I suppose I could use hot-encoding, but I feel that will just look messy and unprofessional.
EDIT: I just got rid of the description and kept the number and converted to int which worked. However, I still would like to see if there's an answer to my original question.
CodePudding user response:
Since your data is already inherently ordered, we can make it ordered categorical, and then rename the categories:
df.grade = pd.Categorical(df.grade, ordered=True)
df.grade = df.grade.cat.rename_categories(df.grade.cat.categories.str[2:])
Given:
df = pd.DataFrame({'grade':['6 Low Average', '5 Bad', '8 Good', '7 Average', '7 Average']})
print(df.sort_values('grade'))
# Output:
grade
1 5 Bad
0 6 Low Average
3 7 Average
4 7 Average
2 8 Good
Doing:
df.grade = pd.Categorical(df.grade, ordered=True)
df.grade = df.grade.cat.rename_categories(df.grade.cat.categories.str[2:])
print(df.sort_values('grade'))
df.sort_values('grade').grade.cat.codes
# Output:
grade
1 Bad
0 Low Average
3 Average
4 Average
2 Good
1 0
0 1
3 2
4 2
2 3
dtype: int8
# If they weren't categorical ordered it'd look like:
# grade
# 3 Average
# 4 Average
# 1 Bad
# 2 Good
# 0 Low Average
CodePudding user response:
It's not possible to define the codes of a Categorical
dtype. When you try to set the codes, this can be raised 2 kinds of exceptions:
>>> pd.Categorical.from_codes(...)
ValueError: codes need to be between -1 and len(categories)-1
>>> mycat.codes = [...]
AttributeError: can't set attribute
Categorical
dtype uses -1 for missing values and 0 from N to index categories (generally in lexicographical order for string values).
The best you can do is to create the category like this:
data = df['grade'].str.split(' ', 1, expand=True).astype({0: int})
grade = pd.CategoricalDtype(data.drop_duplicates(0).sort_values(0)[1], ordered=True, name='Grade')
df['grade'] = data[1].astype(grade)
Output:
>>> grade
CategoricalDtype(categories=['Low Average', 'Average', 'Good', 'Excellent'], ordered=True)
>>> df['grade']
0 Low Average
1 Average
2 Good
3 Excellent
Name: grade, dtype: category
Categories (4, object): ['Low Average' < 'Average' < 'Good' < 'Excellent']
>>> df['grade'].cat.codes
0 0
1 1
2 2
3 3
dtype: int8
CodePudding user response:
Can be done this way:
df = pd.DataFrame({'nr_grade': ['7 Average', '8 Good', '8 Good']})
df[['nr', 'grade']] = df['nr_grade'].str.split(' ', expand=True)
df['grade'] = pd.Categorical(df.grade, categories=df.grade.unique())
df.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 nr_grade 3 non-null object
1 nr 3 non-null object
2 grade 3 non-null category
dtypes: category(1), object(2)
memory usage: 303.0 bytes