How to preserve the original data frame after creating a 'numerized' version of it via cat-CodePudding

I have this data frame called df, which has certain columns with object values. I want to turn those into categorical columns by using the following for loop:

df_numerized = df
for col in df_numerized.columns:
    if df_numerized[col].dtype == 'object':
        df_numerized[col] = df_numerized[col].astype('category')
        df_numerized[col] = df_numerized[col].cat.codes

Now when I call df_numerized I get what I want, but this has also changed the original data frame df in a similar way. How can I run my code without 'numerizing' the original data frame?

CodePudding user response：

Please strat with usage of the copy method.

df_numerized = df.copy()

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.copy.html

CodePudding user response：

df_numerized is simply a named reference to df, so when you update df_numerized the change will be propagated to df, to prevent this you can create a copy of df for e.g you can do df_numerized = df.copy(). However, there is more consice approach using factorize:

cols = df.select_dtypes('object')
df_numerized = df.assign(**cols.apply(lambda s: s.factorize(sort=True)[0]))