Merging two output dataframes


For my first dataframe I predicted the Decision using XGBoost:

    itemId     property_1      property_2     property_n       Decision
 0      i1          88.90             NaN              0           Good
 1      i2          87.09    7.653800e 06              0            Bad
 2      i3          78.90    7.623800e 06              1           Good
 3      i4          93.02             NaN              1            Bad

import xgboost as xgb
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder 
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
import pandas as pd

cols_to_drop = ['itemId'] # drop the non-feature columns
df.drop(cols_to_drop, axis=1, inplace=True)

X = df.drop('Decision', axis=1)  # feature columns
y = df['Decision'] # variable that we need to predict

model = XGBClassifier() 
model.fit(X_train, y_train)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=5)
model = XGBClassifier() 
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

For the second one I predicted, again, the Decision (note that there might be items here that didn't appear in the first dataframe):

     userId        itemId      Decision
  0      u1            i1             0
  1      u1            i2             1
  2      u2            i1             1
  3      u2            i3             0
  4      u2            i4             1
  5      u3            i5             0

import numpy as np
from surprise import KNNWithMeans, Dataset, Reader
from surprise.model_selection import train_test_split

reader = Reader(rating_scale=(0, 1))
data = Dataset.load_from_df(df_2[['userId', 'itenId', 'Decision']], reader)
trainset, testset = train_test_split(data, test_size=0.25)
algo = KNNWithMeans()
test = algo.test(testset)
test = pd.DataFrame(test)
test.drop("details", inplace=True, axis=1)
test.columns = ['userId', 'itemId', 'actual', 'cf_predictions']

Since the output of the second code block gives me something like (for the test set!):

     userId        itemId       actual      cf_predictions
  0      u3            i5            0                0.05
  1      u3            i6            1                0.66
  2      u4            i1            1                0.99
  3      u4            i3            0                0.04
  4      u5            i4            1                0.98
  5      u5            i5            0                0.06

I would like to add the predictions from the first algorithm to the above dataframe. So an extra column with the y_pred for each item. The problems:

  1. y_pred can be converted to 0's and 1's, so that is easy.
  2. I have to somehow add y_pred to the X_test set so that I see for each item the result is?
  3. The test sets probably don't match much, so when merging the two dataframes I would barely get anything.

How can I approach this problem?

CodePudding user response:

Looks like you could use itemId as the key to join your last dataframe with the Decision column from your first dataframe:

import pandas as pd

df = pd.DataFrame({'itemId': ['i0', 'i1', 'i2', 'i3'],
                   'property_1': [88.90, 87.09, 78.90, 93.02],
                   'Decision': ['Good', 'Bad', 'Good', 'Bad']})

test = pd.DataFrame({'userId': ['u3', 'u3', 'u4', 'u4', 'u5', 'u5'],
                     'itemId': ['i5', 'i6', 'i1', 'i3', 'i4', 'i5'],
                     'actual': [0, 1, 1, 0, 1, 0],
                     'cf_predictions': [0.05, 0.66, 0.99, 0.04, 0.98, 0.06]})

predicted = df[['itemId', 'Decision']].set_index('itemId')

test.join(predicted, on='itemId')
    userId  itemId  actual  cf_predictions  Decision
0   u3      i5      0       0.05            NaN
1   u3      i6      1       0.66            NaN
2   u4      i1      1       0.99            Bad
3   u4      i3      0       0.04            Bad
4   u5      i4      1       0.98            NaN
5   u5      i5      0       0.06            NaN
