How to use Machine Learning in Python to predict a binary outcome with a Pandas Dataframe-CodePudding

I have the following code:

import nfl_data_py as nfl
pbp = nfl.import_pbp_data([2022], downcast=True, cache=False, alt_path=None)

which returns a dataframe of every play that occurred in the 2022 NFL season. The columns I want to train it on are score_differential, yardline_100, ydstogo, down and half_seconds_remaining to predict the play_type - either run, or pass.

Example: I feed it a -4 score differential, 25 yard line, 4th down, 16 yards to go, and 300 half seconds remaining - it would return whatever it learned from the dataframe, probably pass.

How would I go about doing this? Should I use a scikeylearn decision tree?

CodePudding user response：

Here you go:

import nfl_data_py as nfl
import pandas as pd
#import train_test_split
from sklearn.model_selection import train_test_split
#we need to encode the play_type column
from sklearn.preprocessing import LabelEncoder 
#import the model
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt


pbp = nfl.import_pbp_data([2022], downcast=True, cache=False, alt_path=None)
df = pd.DataFrame(pbp)
#there are definitely other features you can use, but these are the ones you want.
df = df[['score_differential', 'yardline_100', 'ydstogo', 'down', 'half_seconds_remaining', 'play_type']]
df = df.dropna()
# drop the rows which are 'None', 'No_play'
df = df[df['play_type'] != 'None']
df = df[df['play_type'] != 'no_play']
#reset the index
df = df.reset_index(drop=True)
#encode the play_type column
le = LabelEncoder()
df['play_type_encode'] = le.fit_transform(df['play_type'])
# train test split
X_train, X_test, y_train, y_test = train_test_split(df.drop(['play_type', 'play_type_encode'], axis=1), df['play_type_encode'], test_size=0.3, random_state=42)
#instantiate the model
rfc = RandomForestClassifier(n_estimators=100)
#fit the model
rfc.fit(X_train, y_train)
#predict the model
rfc_pred = rfc.predict(X_test)
#evaluate the model
print(classification_report(y_test, rfc_pred))
#plot the confusion matrix
plt.figure(figsize=(10,6))
sns.heatmap(confusion_matrix(y_test, rfc_pred), annot=True)
plt.xlabel('Predicted')
plt.ylabel('True')