I have a pandas data frame with a factor column with 30 different levels. Some of the levels occur infrequently so I convert them to an 'Other' grouping. The resulting column has 25 different levels plus 1 'other' level.
d = df1['column1'].value_counts() >= 50
df1['column1'] = [i if d[i] else 'Other' for i in df1['column1']]
df1['column1'] = df1['column1'].astype('category')
I have a second data frame that I'd like to convert to have the same levels as the first data frame (including any new levels that don't occur in the first data frame). I've tried the code below but I get a 'Key Error' but it doesn't really explain the issue.
df2['column1'] = [i if d[i] else 'Other' for i in df2['column1']]
df2['column1'] = df2['column1'].astype('category')
Any idea what's causing this?
CodePudding user response:
I was able to reproduce your Key Error
with your code by injecting values into df2['column1']
that don't exist in df1['column1']
.
You could make that process resilient by doing:
df1 = pd.DataFrame({'column1': [f'L{x}' for x in np.random.randint(10, size=100)]})
df2 containing additional values:
df2 = pd.DataFrame({'column1': [f'L{x}' for x in np.random.randint(12, size=100)]})
Get the most frequent levels and translate:
cat_counts = df1['column1'].value_counts()
df1.assign(column1=np.where(df1['column1'].isin(cat_counts[cat_counts > 10].index), df1['column1'], 'other')).astype({'column1': 'category'})
column1
0 L4
1 L9
2 L9
3 other
4 other
.. ...
95 other
96 other
97 other
98 L3
99 other
That same construct also works for df2 even though it contains values not present in df1:
df2.assign(column1=np.where(df2['column1'].isin(cat_counts[cat_counts > 10].index), df2['column1'], 'other')).astype({'column1': 'category'})
column1
0 other
1 L9
2 other
3 other
4 other
.. ...
95 other
96 other
97 other
98 L9
99 other
Another option would be to select the n most frequent levels:
df1['column1'].isin(cat_counts.nlargest(5).index)