How can we convert numerical data to labeled data and make a prediction?-CodePudding

I understand how to encode labeled data into numerical data, using any of several techniques, including One-hot Encoding, Label Encoding, Ordinal Encoding, etc. I am wondering how to convert the numerical data back into labeled data. Here's a simple example.

import pandas as pd
import numpy as np

# Load Library
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import make_moons
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier,GradientBoostingClassifier# Step1: Create data set


# Define the headers since the data does not have any
headers = ["symboling", "normalized_losses", "make", "fuel_type", "aspiration",
           "num_doors", "body_style", "drive_wheels", "engine_location",
           "wheel_base", "length", "width", "height", "curb_weight",
           "engine_type", "num_cylinders", "engine_size", "fuel_system",
           "bore", "stroke", "compression_ratio", "horsepower", "peak_rpm",
           "city_mpg", "highway_mpg", "price"]

# Read in the CSV file and convert "?" to NaN
df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data",
                  header=None, names=headers, na_values="?" )
df.head()

df.columns

df_fin = pd.DataFrame({col: df[col].astype('category').cat.codes for col in df}, index=df.index)
df_fin


X = df_fin[['symboling', 'normalized_losses', 'make', 'fuel_type', 'aspiration',
       'num_doors', 'body_style', 'drive_wheels', 'engine_location',
       'wheel_base', 'length', 'width', 'height', 'curb_weight', 'engine_type',
       'num_cylinders', 'engine_size', 'fuel_system', 'bore', 'stroke',
       'compression_ratio', 'horsepower', 'peak_rpm']]
y = df_fin['city_mpg']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


# Fit a Decision Tree model
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
accuracy_score(y_test, y_pred)

Now, how can I make a prediction of the target variable (dependent variable) based on the independent variables???

Something like this should work, I think, but it doesn't...

clf.predict([[2,164,'audi','gas','std','four','sedan','fwd','front',99.8,176.6,66.2,54.3,2337,'ohc','four',109,'mpfi',3.19,3.4,10,102,5500,24,30,13950,]])

If we leave numerics as numerics, and put quotes around labels, I would like to predict the dependent variable, but I can't, because of the labeled data. If the data was all numerics, and this was a regression problem, it would work!! My question is, how can we convert categorical codes back into numerical labeled data, and make a prediction??

CodePudding user response：

The input data you want to use to predict the target variable needs to be in the same format as the data used for training the model.

I recommend encoding categorical data using e.g. sklearn OneHotEncoder (for one hot encoding, but there is also i.a. OrdinalEncoder and LabelEncoder). This allows you to first fit() the preprocessor on your categorical data which you can then later use to transform() the data you wish to predict.

Example using one hot encoding:

import pandas as pd

from sklearn.preprocessing import OneHotEncoder

df = pd.DataFrame({"car_make": ["audi", "bmw", "bmw", "renault"],
"car_country": ["DE", "DE", "DE", "FR"], "car_age": [1, 3, 1, 5]})

categorical_cols = ["car_make", "car_country"]
enc = OneHotEncoder()
enc.fit(df[categorical_cols]) # fitting the transformer on our categorical data

X_enc = enc.transform(df[categorical_cols]).toarray() # this returns a numpy array with the encoded values

You can use the get_feature_names_out() method on your fitted encoder to get an array of column names. Example building on the above:

df_encoded = pd.DataFrame(X_enc, columns=enc.get_feature_names_out())
print(df_encoded)
   car_make_audi  car_make_bmw  car_make_renault  car_country_DE  car_country_FR
0            1.0           0.0               0.0             1.0             0.0
1            0.0           1.0               0.0             1.0             0.0
2            0.0           1.0               0.0             1.0             0.0
3            0.0           0.0               1.0             0.0             1.0

# getting our original values:
df_orig = enc.inverse_transform(X_enc)
print(df_orig)
[['audi' 'DE']
 ['bmw' 'DE']
 ['bmw' 'DE']
 ['renault' 'FR']]

If you then want to transform the values back to their original values you can use inverse_transform on your encoded data to return them.

I recommend looking at the docs for more details and use cases: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder

Using sklearn preprocessors will save you a lot of trouble down the road!