reading a dataset in pandas-CodePudding

I am trying to do a pretty simple task, but unable to understand pandas behavior. I am reading a UCI dataset in python using pandas:

data = pd.read_csv('UCI/breast-cancer-wisconsin.data', header=None)

Printing out the first row

data.values[0]
# array([1000025, 5, 1, 1, 1, 2, '1', 3, 1, 1, 2], dtype=object)

Why does it read the 7th column as string? I tried the following:

print(pd.api.types.infer_dtype(data[6])) #returns string

It is a pretty simple dataset, directly downloaded form this link, and all values appear integers to me. Then why is the 6th column interpreted as a string?

CodePudding user response：

Check unique values - there is character ? for not exist value, so column is parsed to strings:

data = pd.read_csv('breast-cancer-wisconsin.data', header=None)[6]
print (data.unique())
['1' '10' '2' '4' '3' '9' '7' '?' '5' '8' '6']

Solution is add parameter na_values='?' for convert ? to missing values:

data = pd.read_csv('breast-cancer-wisconsin.data', header=None, na_values='?')
print (data.dtypes)
0       int64
1       int64
2       int64
3       int64
4       int64
5       int64
6     float64
7       int64
8       int64
9       int64
10      int64
dtype: object

CodePudding user response：

I loaded the dataset. Note that that specific column has a value "?". This results in dtype = object since it cannot automatically be casted to an integer.

data.iloc[:, 6].unique()

Results in:

array(['1', '10', '2', '4', '3', '9', '7', '?', '5', '8', '6'],
      dtype=object)

CodePudding user response：

This probably due some string type of value in the column. To quickly test that, try this (should result in error)

df.iloc[:,6].astype("int")