Python Pandas replace new values in column with 'other'-CodePudding

I have a pandas data frame with a factor column with 30 different levels. Some of the levels occur infrequently so I convert them to an 'Other' grouping. The resulting column has 25 different levels plus 1 'other' level.

d = df1['column1'].value_counts() >= 50
df1['column1'] = [i if d[i] else 'Other' for i in df1['column1']]
df1['column1'] = df1['column1'].astype('category')

I have a second data frame that I'd like to convert to have the same levels as the first data frame (including any new levels that don't occur in the first data frame). I've tried the code below but I get a 'Key Error' but it doesn't really explain the issue.

df2['column1'] = [i if d[i] else 'Other' for i in df2['column1']]
df2['column1'] = df2['column1'].astype('category')

Any idea what's causing this?

CodePudding user response：

I was able to reproduce your Key Error with your code by injecting values into df2['column1'] that don't exist in df1['column1'].

You could make that process resilient by doing:

df1 = pd.DataFrame({'column1': [f'L{x}' for x in np.random.randint(10, size=100)]})

df2 containing additional values:

df2 = pd.DataFrame({'column1': [f'L{x}' for x in np.random.randint(12, size=100)]})

Get the most frequent levels and translate:

cat_counts = df1['column1'].value_counts()

df1.assign(column1=np.where(df1['column1'].isin(cat_counts[cat_counts > 10].index), df1['column1'], 'other')).astype({'column1': 'category'})

   column1
0       L4
1       L9
2       L9
3    other
4    other
..     ...
95   other
96   other
97   other
98      L3
99   other

That same construct also works for df2 even though it contains values not present in df1:

df2.assign(column1=np.where(df2['column1'].isin(cat_counts[cat_counts > 10].index), df2['column1'], 'other')).astype({'column1': 'category'})

   column1
0    other
1       L9
2    other
3    other
4    other
..     ...
95   other
96   other
97   other
98      L9
99   other

Another option would be to select the n most frequent levels:

df1['column1'].isin(cat_counts.nlargest(5).index)