ValueError: Found unknown categories [nan] in column 2 during fit-CodePudding

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split,cross_val_score
from sklearn.tree import DecisionTreeClassifier

path = r"C:\Users\thund\Downloads\Boat.csv"
data = pd.read_csv(path)  # pip install xlrd

print(data.shape)

print(data.columns)

print(data.isnull().sum())
print (data.dropna(axis=0))  #dropping rows that have missing values

print (data['Class'].value_counts())

print(data['Class'].value_counts().plot(kind = 'bar'))
#plt.show()

data['safety'].value_counts().plot(kind = 'bar')
#plt.show()


import seaborn as sns
sns.countplot(data['demand'], hue = data['Class'])
#plt.show()

X = data.drop(['Class'], axis = 1)
y = data['Class']

from sklearn.preprocessing import OrdinalEncoder
demand_category = ['low', 'med', 'high', 'vhigh']
maint_category = ['low', 'med', 'high', 'vhigh']
seats_category = ['2', '3', '4', '5more']
passenger_category = ['2', '4', 'more']
storage_category = ['Nostorage', 'small', 'med']
safety_category = ['poor', 'good', 'vgood']
all_categories = [demand_category, maint_category,seats_category,passenger_category,storage_category,safety_category]


oe = OrdinalEncoder(categories= all_categories)
X = oe.fit_transform( data[['demand','maint', 'seats', 'passenger', 'storage', 'safety']])

Dataset: https://drive.google.com/file/d/1O0sYZGJep4JkrSgGeJc5e_Nlao2bmegV/view?usp=sharing

For the mentioned code I keep getting this 'ValueError: Found unknown categories [nan] in column 2 during fit'. I have tried dropping all missing values. I tried searching for a fix and I found someone's suggestion on using handle_unknown="ignore", but I don't think it works for ordinal encoding. I am fairly new to python so would deeply appreciate it if someone could give me an in-depth analysis of why this is happening and how can I work to fix it.

Ps: This is for pre-processing the data.

CodePudding user response：

To explain the error, You have dropped the "NaN" and just printed the DataFrame with dropped data.

According to your dataset or the ERROR you have a value "NaN" in column "seats".

When you print out the data['seats'].unique(), You get something like this:

['2' '3' '4' '5more' nan]

There are two solutions:

Using inplace :
```
`data.dropna(inplace=True)`
```
What this does is , it updates the original DataFrame to its updated value
Manually assigning:
```
`data = data.dropna()`
```
This exactly does what 'inplace' does but its not that effecient but more understandable.

Hope this answers your question.