Home > OS >  Merging some list and dictionary data into one ndarray
Merging some list and dictionary data into one ndarray

Time:03-19

I have a few data structures returned from a RandomForestClassifier() and from encoding string data from a CSV. I am predicting the probability of certain crimes happening given some weather data. The model part works well but I'm a bit of a Python nooby and can't wrap my head around merging this data.

Here's a dumbed down version of what I have:

#this line is pseudo code
data = from_csv_file

label_dict = { 'Assault': 0, 'Robbery': 1 }

# index 0 of each cell in predictions is Assault, index 1 is Robbery
encoded_labels = [0, 1]

# Probabilities of crime being assault or robbery
predictions = [
               [0.4, 0.6], 
               [0.1, 0.9], 
               [0.8, 0.2], 
               ...
              ]

I'd like to add a new column to data for each crime label with the cell contents being the probability, e.g. new columns called prob_Assault and prob_Robbery. Eventually I'd like to add a boolean column (True/False) that shows if the prediction was correct or not.

How could I go about this? Using Python 3.10, pandas, numpy, and scikit-learn.

EDIT: Might be easier for some if you saw the important part of my actual code

# Training data X, Y
tr_y = tr['offence']
tr_x = tr.drop('offence', axis=1)

# Test X (what to predict)
test_x = test_data.drop('offence', axis=1)

clf = RandomForestClassifier(n_estimators=40)
fitted = clf.fit(tr_x, tr_y)
pred = clf.predict_proba(test_x)
encoded_labels = fitted.classes_

# I also have the encodings dictionary that shows the encodings for crime types

CodePudding user response:

You are on the right track. What you need is to reformat the predictions from list to a numpy array and then access to its columns:

import numpy as np
predictions = np.array(predictions)
data["prob_Assault"] = predictions[:,0]
data["prob_Robbery"] = predictions[:,1]

I am assuming that data is a pandas dataframe. I am not sure how you want to evaluate these probabilities, but you can use logical statements in the pandas as well:

data["prob_Assault"] == 0.8 # For example, 0.8 is the correct probability

The code above will return a Series of boolean such as:

0     True
1    False
2    False
...

You can assign these values to the dataframe as a new column:

data["check"] = data["prob_Assault"] == 0.8

Or even select the True rows of the dataframe:

data[data["prob_Assault"] == 0.8]

CodePudding user response:

Maybe I misunderstood your problem, but if not, that could be a solution :

  • Create a dataframe with two columns : prob_Assault and prob_Robbery.

    predictions_df = pd.DataFrame(predictions, columns = ['prob_Assault', 'prob_Robbery'])

  • Join that predictions_df to your data

  • Related