Data type not changing-CodePudding

I have a data set Chronic Kidney Disease from kaggle. The dataset has dtypes of ``int64 , float64 and object.` To train an efficient machine learning model I wanted to convert all the values to integer/float type. I used the func astype(int) and astype(float) for that. But two columns are still not changing.

import pandas as pd 
train = pd.read_csv("/content/kidney_disease_train.csv")
test = pd.read_csv("/content/kidney_disease_test.csv")

I changed the major value to integer through. There were some corrupt values like \t? in the columns wc and rc which were also changed

train["rbc"] = (train["rbc"] == "normal").astype(int)
train["pc"] = (train["pc"] == "normal").astype(int)
train["pcc"] = (train["pcc"] == "present").astype(int)
train["ba"] = (train["ba"] == "present").astype(int)
train["htn"] = (train["htn"] == "yes").astype(int)
train["dm"] = (train["dm"] == "yes").astype(int)
train["cad"] = (train["cad"] == "yes").astype(int)
train["appet"] = (train["appet"] == "good").astype(int)
train["pe"] = (train["pe"] == "yes").astype(int)
train["ane"] = (train["ane"] == "yes").astype(int)
train["classification"] = (train["classification"] == "ckd").astype(int)
train.replace(to_replace = "\tno" , value = "no" , inplace = True)
train.replace(to_replace = "\tyes" , value = "yes" , inplace = True)
train.replace(to_replace = "\t8400" , value = 8400 , inplace = True)
train.replace(to_replace = "\t?" , value = 4500 , inplace = True)

I wanted to calculate mean to fill it with null values . But columns wc and rc are showing the error TypeError: can only concatenate str (not "int") to str . I tried changing these values to float instead of integer, and that block of code runs perfectly without any error. But the particular columns still shows the same error. Other columns gets their data type changed. But these two columns still remain object

I tried changing the data types of the columns which were successful. But the df.info() still shows them as object data types

I am expecting any lead to the solution

CodePudding user response：

You can use:

# Why 4500?
train['wc'] = train['wc'].replace({r'\t': '', r'\?': '4500'}, regex=True).astype(float)

# np.nan?
train['rc'] = train['rc'].replace({r'\t': '', r'\?': np.nan}, regex=True).astype(float)

Output:

>>> train.dtypes
id                  int64
age               float64
bp                float64
sg                float64
al                float64
su                float64
rbc                 int64
pc                  int64
pcc                 int64
ba                  int64
bgr               float64
bu                float64
sc                float64
sod               float64
pot               float64
hemo              float64
pcv               float64
wc                float64
rc                float64
htn                 int64
dm                  int64
cad                 int64
appet               int64
pe                  int64
ane                 int64
classification      int64
dtype: object

Data exploration

>>> train['wc']
array(['7900', nan, '7200', '8300', '4200', '9900', '10500', '2200',
       '7500', '8400', '15700', '7000', '4700', '9600', '6700', '9000',
       '5900', '4300', '12700', '5500', '5000', '9700', '6900', '9800',
       '5800', '6400', '8100', '15200', '5600', '14900', '9100', '26400',
       '8000', '6500', '9200', '6800', '10800', '4500', '\t?', '10700',
       '11000', '9400', '6300', '10300', '9500', '6200', '6600', '4100',
       '7700', '5400', '13600', '\t8400', '11500', '10900', '12200',
       '8600', '7300', '5200', '7400', '12800', '6000', '9300', '7800',
       '10400', '8800', '10200', '16700', '8500', '21600', '12500',
       '13200', '5100', '12300', '18900', '5700', '8200', '16300', '4900',
       '14600'], dtype=object)

>>> train['rc'].unique()
array(['3.9', nan, '5.5', '4.6', '3.4', '4.7', '6.1', '2.6', '5.6', '3.3',
       '3.8', '5.0', '4.5', '5.7', '3.5', '6.0', '5.2', '4.2', '3.7',
       '5.9', '4.9', '4.8', '3.2', '3.0', '3.6', '4.0', '2.5', '4.1',
       '6.2', '5.1', '6.5', '5.8', '4.4', '5.4', '3', '4.3', '4', '2.1',
       '8.0', '5.3', '3.1', '2.3', '\t?', '2.9', '6.3', '6.4', '2.4',
       '2.7', '5'], dtype=object)