How can we use a Classifier to Make a Prediction based on Numeric Data and Labeled Data?-CodePudding

I am trying to train and fit a classifier, and then use it to make a prediction, based on a combination of numeric data and labeled data.

I am trying to predict the price of a vehicle, based on these prediction variables.

prediction_values = [2, 164, 'audi', 'gas', 'std', 'four', 'sedan', 'fwd', 'front', 99.8, 176.6, 66.2, 54.3, 2337, 'ohc', 'four', 109, 'mpfi', 3.19, 3.4, 10, 102, 5500, 30]

Here is my code.

import pandas as pd
import numpy as np

# Load Library
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import make_moons
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier,GradientBoostingClassifier# Step1: Create data set


# Define the headers since the data does not have any
headers = ["symboling", "normalized_losses", "make", "fuel_type", "aspiration",
           "num_doors", "body_style", "drive_wheels", "engine_location",
           "wheel_base", "length", "width", "height", "curb_weight",
           "engine_type", "num_cylinders", "engine_size", "fuel_system",
           "bore", "stroke", "compression_ratio", "horsepower", "peak_rpm",
           "city_mpg", "highway_mpg", "price"]

# Read in the CSV file and convert "?" to NaN
df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data",
                  header=None, names=headers, na_values="?" )
df.head()

df.columns

df_fin = pd.DataFrame({col: df[col].astype('category').cat.codes for col in df}, index=df.index)
df_fin


X = df_fin[["symboling", "normalized_losses", "make", "fuel_type", "aspiration",
           "num_doors", "body_style", "drive_wheels", "engine_location",
           "wheel_base", "length", "width", "height", "curb_weight",
           "engine_type", "num_cylinders", "engine_size", "fuel_system",
           "bore", "stroke", "compression_ratio", "horsepower", "peak_rpm",
           "city_mpg", "highway_mpg"]]

y = df_fin["price"]


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit a Decision Tree model
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
accuracy_score(y_test, y_pred)


# create a map of your columns values with the corresponding categorical values
col_dictionary = {}
for col in df:
    dictionary = dict(enumerate(df[col].astype('category').cat.categories))
    col_dictionary[col] = {v: k for k, v in dictionary.items()}


# then use this map to convert the array you want to predict
prediction_values = [2, 164, 'audi', 'gas', 'std', 'four', 'sedan', 'fwd', 'front', 99.8, 176.6, 66.2, 54.3, 2337, 'ohc', 'four', 109, 'mpfi', 3.19, 3.4, 10, 102, 5500, 30]
to_predict = []
for (column, value) in zip(X.columns, prediction_values):
    to_predict.append(col_dictionary[column][value])
to_predict_df = pd.DataFrame([to_predict], columns=X.columns)
clf.predict([to_predict_df.iloc[0].values])

When I run the code, I get this error.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

  Input In [101] in <cell line: 5>
    to_predict_df = pd.DataFrame([to_predict], columns=X.columns)

  File ~\AppData\Roaming\Python\Python39\site-packages\pandas\core\frame.py:570 in __init__
    arrays, columns = to_arrays(data, columns, dtype=dtype)

  File ~\AppData\Roaming\Python\Python39\site-packages\pandas\core\internals\construction.py:528 in to_arrays
    return _list_to_arrays(data, columns, coerce_float=coerce_float, dtype=dtype)

  File ~\AppData\Roaming\Python\Python39\site-packages\pandas\core\internals\construction.py:571 in _list_to_arrays
    raise ValueError(e) from e

ValueError: 25 columns passed, passed data had 24 columns

CodePudding user response：

There is nothing wrong with the classifier. When you run a quick check, you can see there is something wrong with the prediction_values array. It is missing a value.

It's length is 24 and X.columns has a length of 25. This shows that the error is happening due to the length mismatch.

If you can fix the prediction_values array, you are good to go.