Merging some list and dictionary data into one ndarray-CodePudding

I have a few data structures returned from a RandomForestClassifier() and from encoding string data from a CSV. I am predicting the probability of certain crimes happening given some weather data. The model part works well but I'm a bit of a Python nooby and can't wrap my head around merging this data.

Here's a dumbed down version of what I have:

#this line is pseudo code
data = from_csv_file

label_dict = { 'Assault': 0, 'Robbery': 1 }

# index 0 of each cell in predictions is Assault, index 1 is Robbery
encoded_labels = [0, 1]

# Probabilities of crime being assault or robbery
predictions = [
               [0.4, 0.6], 
               [0.1, 0.9], 
               [0.8, 0.2], 
               ...
              ]

I'd like to add a new column to data for each crime label with the cell contents being the probability, e.g. new columns called prob_Assault and prob_Robbery. Eventually I'd like to add a boolean column (True/False) that shows if the prediction was correct or not.

How could I go about this? Using Python 3.10, pandas, numpy, and scikit-learn.

EDIT: Might be easier for some if you saw the important part of my actual code

# Training data X, Y
tr_y = tr['offence']
tr_x = tr.drop('offence', axis=1)

# Test X (what to predict)
test_x = test_data.drop('offence', axis=1)

clf = RandomForestClassifier(n_estimators=40)
fitted = clf.fit(tr_x, tr_y)
pred = clf.predict_proba(test_x)
encoded_labels = fitted.classes_

# I also have the encodings dictionary that shows the encodings for crime types

CodePudding user response：

You are on the right track. What you need is to reformat the predictions from list to a numpy array and then access to its columns:

import numpy as np
predictions = np.array(predictions)
data["prob_Assault"] = predictions[:,0]
data["prob_Robbery"] = predictions[:,1]

I am assuming that data is a pandas dataframe. I am not sure how you want to evaluate these probabilities, but you can use logical statements in the pandas as well:

data["prob_Assault"] == 0.8 # For example, 0.8 is the correct probability

The code above will return a Series of boolean such as:

0     True
1    False
2    False
...

You can assign these values to the dataframe as a new column:

data["check"] = data["prob_Assault"] == 0.8

Or even select the True rows of the dataframe:

data[data["prob_Assault"] == 0.8]

CodePudding user response：

Maybe I misunderstood your problem, but if not, that could be a solution :

Create a dataframe with two columns : prob_Assault and prob_Robbery.

predictions_df = pd.DataFrame(predictions, columns = ['prob_Assault', 'prob_Robbery'])
Join that predictions_df to your data