I am trying to do a pretty simple task, but unable to understand pandas behavior. I am reading a UCI dataset in python using pandas:
data = pd.read_csv('UCI/breast-cancer-wisconsin.data', header=None)
Printing out the first row
data.values[0]
# array([1000025, 5, 1, 1, 1, 2, '1', 3, 1, 1, 2], dtype=object)
Why does it read the 7th column as string? I tried the following:
print(pd.api.types.infer_dtype(data[6])) #returns string
It is a pretty simple dataset, directly downloaded form this link, and all values appear integers to me. Then why is the 6th column interpreted as a string?
CodePudding user response:
Check unique values - there is character ?
for not exist value, so column is parsed to strings:
data = pd.read_csv('breast-cancer-wisconsin.data', header=None)[6]
print (data.unique())
['1' '10' '2' '4' '3' '9' '7' '?' '5' '8' '6']
Solution is add parameter na_values='?'
for convert ?
to missing values:
data = pd.read_csv('breast-cancer-wisconsin.data', header=None, na_values='?')
print (data.dtypes)
0 int64
1 int64
2 int64
3 int64
4 int64
5 int64
6 float64
7 int64
8 int64
9 int64
10 int64
dtype: object
CodePudding user response:
I loaded the dataset. Note that that specific column has a value "?". This results in dtype = object since it cannot automatically be casted to an integer.
data.iloc[:, 6].unique()
Results in:
array(['1', '10', '2', '4', '3', '9', '7', '?', '5', '8', '6'],
dtype=object)
CodePudding user response:
This probably due some string type of value in the column. To quickly test that, try this (should result in error)
df.iloc[:,6].astype("int")