How do I perform prediction algorithms on survey data?-CodePudding

for example- I have a dataset like this

If I have an input like "Son 19-30 Read the book No" then how do I get the prediction of online shopping based on this? What kind of machine learning approach should I consider?

CodePudding user response：

You can use any classification algorithm for this problem but before that, you have to preprocess the data For gender, you can use 0-1. 1 for son and 0 for girl For Age, you can category onehotencoder 0-2 for different age groups. For Leisure you have to use NLP technique to separate keywords in sentences, such as Games, Read Books, Internet, Music. These keywords can be used in onehotencoder.

After all that you can apply any classification algorithms.

CodePudding user response：

One of the most important aspects of ML is data. The number of your dataset is very small. On the other hand, it's a binary classification that has 2 classes. There is just one sample of class NO which mostly makes your model tend to predict Yes.

So, I recommend you to find more data first, then for predicting you can use Decision Trees, Support Vector Machines, and Logistic Regressions.

CodePudding user response：

Ok, let's do it. Firstly, we need preprocess the data.

import pandas as pd
from sklearn import preprocessing #for data preprocessing

df = ... #it's var for your table

We need to make gender binary (0 - male, 1 - female)

new_gender_columns = pd.get_dummies(df['Gender'])

Then add new_gender_columns to your DataFrame

df = df.join(new_gender_columns)

And we need to delete old object Gender column

df.drop('Gender', axis=1, inplace=True)

But we need to make the same to Married column. For example, we could define function for all object columns or write previous code for Married.

def df_dummies(df, columns: list):
   for col in columns:
      new_dummies_columns = pd.get_dummies(df[col])
      df = df.join(new_dummies_columns)
      df.drop(col, axis=1, inplace = True)
   return df

Now, make list of object columns:

obj_cols = ['Gender', 'Married']

Run function:

df = df_dummies(df, obj_cols)

Then we have columns that we could process by label encoder. For example, age:

0-18 = label 0

19-30 = label 1

etc.

Let's make list of that columns:

labels_cols = ['Age', 'Leisure', 'Online Shopping']
le = preprocessing.LabelEncoder()
for col in labels_cols:
   le.fit(df[col])
   df[col] = le.transform(df['col'])

Now, machine learning. Let's import random forest classifier. Because we need to predict classes ('yes' or 'no', 1 or 0).

from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier()
rfc.fit(df.iloc[:, :-1], df.iloc[:, -1])

Ok, model is fine, you are ready to test it. Have a safe journey, young padawan.