Home > Mobile >  How to read an object dtypte attribute as int in Pandas column oR "ValueError: invalid literal
How to read an object dtypte attribute as int in Pandas column oR "ValueError: invalid literal

Time:09-24

I'm trying to analysis a dataset by python. The dataset has many types (int, float, string), I converted all types except 2 attributes called (source port , destination port) whose dtype is object.

when explore this attributes in python :

     Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   sport   668522 non-null  object
 1   dport   668522 non-null  object

The values are:

  sport dport
0  6226    80
1  6227    80
2  6228    80
3  6229    80
4  6230    80

In my view, there are just number values, why does python deal with the port as an object?
I tried also using the Weka tool, but the program can't read values, can anyone explain to me the reason, or how to solve the problem. The port is an important feature, it is useful in mining the data, I don't want to drop it from a dataset.

update: The dataset format (CSV). The sample of values above up. There are 2 features ( source port, in short "sport" ) and ( destination port, in short "dport")

In python, to read values :

import pandas as pd 
dt = pd.read_csv("port.csv")

when print dt show values but when using ML algorithm like k-means can't deal with it.

on the other hand, in Weka, after importing the csv file, was displayed the following message "Attribute is neither numeric nor norminal"

CodePudding user response:

We can convert the dtype of column to any data type by using astype. So you don't need to drop the column instead change the dtype.

import pandas as pd

#create DataFrame
df = pd.DataFrame({'player': ['A', 'B', 'C', 'D', 'E'],
                   'points': ['25', '27', '0x0303', '17', '20'],
                   'assists': ['5', '7', '10', '8', '9']})
print(df.dtypes)
#convert 'points' column to integer
#df['points'] = df['points'].astype(int) 
# error as there are no non-numeric value in column 'points'

df['points'] = pd.to_numeric(df['points'],errors='coerce').astype('Int64') 
# 'coerce' will ignore all the non-numeric values and replace it with Nan

#check dtype after conversion
print("after data type conversion \n", df.dtypes)


# ouptut will look like this
player     object
points     object
assists    object
dtype: object
after data type conversion 
player     object
points      int64
assists    object
dtype: object

This answer might help you why pandas use object as dtype?

  • Related