I'm trying to analysis a dataset by python. The dataset has many types (int, float, string), I converted all types except 2 attributes called (source port , destination port) whose dtype is object.
when explore this attributes in python :
Column Non-Null Count Dtype
--- ------ -------------- -----
0 sport 668522 non-null object
1 dport 668522 non-null object
The values are:
sport dport
0 6226 80
1 6227 80
2 6228 80
3 6229 80
4 6230 80
In my view, there are just number values, why does python deal with the port as an object?
I tried also using the Weka tool, but the program can't read values, can anyone explain to me the reason, or how to solve the problem.
The port is an important feature, it is useful in mining the data, I don't want to drop it from a dataset.
update: The dataset format (CSV). The sample of values above up. There are 2 features ( source port, in short "sport" ) and ( destination port, in short "dport")
In python, to read values :
import pandas as pd
dt = pd.read_csv("port.csv")
when print dt
show values but when using ML algorithm like k-means can't deal with it.
on the other hand, in Weka, after importing the csv file, was displayed the following message "Attribute is neither numeric nor norminal"
CodePudding user response:
We can convert the dtype of column to any data type by using astype. So you don't need to drop the column instead change the dtype.
import pandas as pd
#create DataFrame
df = pd.DataFrame({'player': ['A', 'B', 'C', 'D', 'E'],
'points': ['25', '27', '0x0303', '17', '20'],
'assists': ['5', '7', '10', '8', '9']})
print(df.dtypes)
#convert 'points' column to integer
#df['points'] = df['points'].astype(int)
# error as there are no non-numeric value in column 'points'
df['points'] = pd.to_numeric(df['points'],errors='coerce').astype('Int64')
# 'coerce' will ignore all the non-numeric values and replace it with Nan
#check dtype after conversion
print("after data type conversion \n", df.dtypes)
# ouptut will look like this
player object
points object
assists object
dtype: object
after data type conversion
player object
points int64
assists object
dtype: object
This answer might help you why pandas use object as dtype?