I want to use categorical features directly with CatBoost model and I need to declare my object columns as categorical in Catboost model . I have a column in my data frame which is an object containing nace codes looking like this:
NACE_code
5632 81.101
8060 41.200
15147 43.120
24644 68.100
29144 86.909
37122 68
39853 43
59268 43
108633 70.220
108693 56.102
175820 43.320
184606 41.200
Name: NACE_code, dtype: object
Python doesn't accept this column as categorical column. Instead it tells me that this is a float since some of the values have dots. I am relatively new in python and I have tried different ways to remove the dot from those values but my last attempt changes all those values without dot to NAN.
df['NACE_code'].str.replace(r"(\d)\.", r"\1")
5632 81101
8060 41200
15147 43120
24644 68100
29144 86909
37122 NaN
39853 NaN
59268 NaN
108633 70220
108693 56102
175820 43320
184606 41200
Name: NACE_KODE, dtype: object
How do I get my column to look like this? I appreciate any help I can get!
5632 81101
8060 41200
15147 43120
24644 68100
29144 86909
37122 68
39853 43
59268 43
108633 70220
108693 56102
175820 43320
184606 41200
CodePudding user response:
Use astype('str')
to convert columns to string type before calling str.replace.
Without regex:
df['NACE_code'].astype('str').str.replace(r".", r"", regex=False)
CodePudding user response:
# The following code should work:
df.NACE_code = df.NACE_code.astype(str)
df.NACE_code = df.NACE_code.str.replace('.', '')
CodePudding user response:
Thanks for the response Aakash Dusane and gajendragarg!
When I run either of these, new digits appears at the end of those values without dots. The output is:
5632 81101
8060 41200
15147 43120
24644 68100
29144 86909
37122 6811
39853 4311
59268 4311
108633 70220
108693 56102
175820 43320
184606 41200
Do you know why?