I have this data frame called df
, which has certain columns with object values. I want to turn those into categorical columns by using the following for loop:
df_numerized = df
for col in df_numerized.columns:
if df_numerized[col].dtype == 'object':
df_numerized[col] = df_numerized[col].astype('category')
df_numerized[col] = df_numerized[col].cat.codes
Now when I call df_numerized
I get what I want, but this has also changed the original data frame df
in a similar way. How can I run my code without 'numerizing' the original data frame?
CodePudding user response:
Please strat with usage of the copy
method.
df_numerized = df.copy()
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.copy.html
CodePudding user response:
df_numerized
is simply a named reference to df
, so when you update df_numerized
the change will be propagated to df
, to prevent this you can create a copy of df
for e.g you can do df_numerized = df.copy()
. However, there is more consice approach using factorize
:
cols = df.select_dtypes('object')
df_numerized = df.assign(**cols.apply(lambda s: s.factorize(sort=True)[0]))