I have a data set Chronic Kidney Disease from kaggle. The dataset has dtypes of ``int64 , float64 and object.` To train an efficient machine learning model I wanted to convert all the values to integer/float type. I used the func astype(int) and astype(float) for that. But two columns are still not changing.
import pandas as pd
train = pd.read_csv("/content/kidney_disease_train.csv")
test = pd.read_csv("/content/kidney_disease_test.csv")
I changed the major value to integer through. There were some corrupt values like \t?
in the columns wc and rc
which were also changed
train["rbc"] = (train["rbc"] == "normal").astype(int)
train["pc"] = (train["pc"] == "normal").astype(int)
train["pcc"] = (train["pcc"] == "present").astype(int)
train["ba"] = (train["ba"] == "present").astype(int)
train["htn"] = (train["htn"] == "yes").astype(int)
train["dm"] = (train["dm"] == "yes").astype(int)
train["cad"] = (train["cad"] == "yes").astype(int)
train["appet"] = (train["appet"] == "good").astype(int)
train["pe"] = (train["pe"] == "yes").astype(int)
train["ane"] = (train["ane"] == "yes").astype(int)
train["classification"] = (train["classification"] == "ckd").astype(int)
train.replace(to_replace = "\tno" , value = "no" , inplace = True)
train.replace(to_replace = "\tyes" , value = "yes" , inplace = True)
train.replace(to_replace = "\t8400" , value = 8400 , inplace = True)
train.replace(to_replace = "\t?" , value = 4500 , inplace = True)
I wanted to calculate mean to fill it with null values . But columns wc and rc
are showing the error TypeError: can only concatenate str (not "int") to str
. I tried changing these values to float instead of integer, and that block of code runs perfectly without any error. But the particular columns still shows the same error. Other columns gets their data type changed. But these two columns still remain object
I tried changing the data types of the columns which were successful. But the df.info() still shows them as object data types
I am expecting any lead to the solution
CodePudding user response:
You can use:
# Why 4500?
train['wc'] = train['wc'].replace({r'\t': '', r'\?': '4500'}, regex=True).astype(float)
# np.nan?
train['rc'] = train['rc'].replace({r'\t': '', r'\?': np.nan}, regex=True).astype(float)
Output:
>>> train.dtypes
id int64
age float64
bp float64
sg float64
al float64
su float64
rbc int64
pc int64
pcc int64
ba int64
bgr float64
bu float64
sc float64
sod float64
pot float64
hemo float64
pcv float64
wc float64
rc float64
htn int64
dm int64
cad int64
appet int64
pe int64
ane int64
classification int64
dtype: object
Data exploration
>>> train['wc']
array(['7900', nan, '7200', '8300', '4200', '9900', '10500', '2200',
'7500', '8400', '15700', '7000', '4700', '9600', '6700', '9000',
'5900', '4300', '12700', '5500', '5000', '9700', '6900', '9800',
'5800', '6400', '8100', '15200', '5600', '14900', '9100', '26400',
'8000', '6500', '9200', '6800', '10800', '4500', '\t?', '10700',
'11000', '9400', '6300', '10300', '9500', '6200', '6600', '4100',
'7700', '5400', '13600', '\t8400', '11500', '10900', '12200',
'8600', '7300', '5200', '7400', '12800', '6000', '9300', '7800',
'10400', '8800', '10200', '16700', '8500', '21600', '12500',
'13200', '5100', '12300', '18900', '5700', '8200', '16300', '4900',
'14600'], dtype=object)
>>> train['rc'].unique()
array(['3.9', nan, '5.5', '4.6', '3.4', '4.7', '6.1', '2.6', '5.6', '3.3',
'3.8', '5.0', '4.5', '5.7', '3.5', '6.0', '5.2', '4.2', '3.7',
'5.9', '4.9', '4.8', '3.2', '3.0', '3.6', '4.0', '2.5', '4.1',
'6.2', '5.1', '6.5', '5.8', '4.4', '5.4', '3', '4.3', '4', '2.1',
'8.0', '5.3', '3.1', '2.3', '\t?', '2.9', '6.3', '6.4', '2.4',
'2.7', '5'], dtype=object)