I have a few data structures returned from a RandomForestClassifier()
and from encoding string data from a CSV. I am predicting the probability of certain crimes happening given some weather data. The model part works well but I'm a bit of a Python nooby and can't wrap my head around merging this data.
Here's a dumbed down version of what I have:
#this line is pseudo code
data = from_csv_file
label_dict = { 'Assault': 0, 'Robbery': 1 }
# index 0 of each cell in predictions is Assault, index 1 is Robbery
encoded_labels = [0, 1]
# Probabilities of crime being assault or robbery
predictions = [
[0.4, 0.6],
[0.1, 0.9],
[0.8, 0.2],
...
]
I'd like to add a new column to data
for each crime label with the cell contents being the probability, e.g. new columns called prob_Assault
and prob_Robbery
. Eventually I'd like to add a boolean column (True/False) that shows if the prediction was correct or not.
How could I go about this? Using Python 3.10, pandas, numpy, and scikit-learn.
EDIT: Might be easier for some if you saw the important part of my actual code
# Training data X, Y
tr_y = tr['offence']
tr_x = tr.drop('offence', axis=1)
# Test X (what to predict)
test_x = test_data.drop('offence', axis=1)
clf = RandomForestClassifier(n_estimators=40)
fitted = clf.fit(tr_x, tr_y)
pred = clf.predict_proba(test_x)
encoded_labels = fitted.classes_
# I also have the encodings dictionary that shows the encodings for crime types
CodePudding user response:
You are on the right track. What you need is to reformat the predictions
from list to a numpy array and then access to its columns:
import numpy as np
predictions = np.array(predictions)
data["prob_Assault"] = predictions[:,0]
data["prob_Robbery"] = predictions[:,1]
I am assuming that data
is a pandas dataframe. I am not sure how you want to evaluate these probabilities, but you can use logical statements in the pandas as well:
data["prob_Assault"] == 0.8 # For example, 0.8 is the correct probability
The code above will return a Series of boolean such as:
0 True
1 False
2 False
...
You can assign these values to the dataframe as a new column:
data["check"] = data["prob_Assault"] == 0.8
Or even select the True
rows of the dataframe:
data[data["prob_Assault"] == 0.8]
CodePudding user response:
Maybe I misunderstood your problem, but if not, that could be a solution :
Create a dataframe with two columns : prob_Assault and prob_Robbery.
predictions_df = pd.DataFrame(predictions, columns = ['prob_Assault', 'prob_Robbery'])
Join that
predictions_df
to your data