For my first dataframe I predicted the Decision
using XGBoost:
itemId property_1 property_2 property_n Decision
0 i1 88.90 NaN 0 Good
1 i2 87.09 7.653800e 06 0 Bad
2 i3 78.90 7.623800e 06 1 Good
3 i4 93.02 NaN 1 Bad
...
import xgboost as xgb
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
import pandas as pd
cols_to_drop = ['itemId'] # drop the non-feature columns
df.drop(cols_to_drop, axis=1, inplace=True)
X = df.drop('Decision', axis=1) # feature columns
y = df['Decision'] # variable that we need to predict
model = XGBClassifier()
model.fit(X_train, y_train)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=5)
model = XGBClassifier()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
For the second one I predicted, again, the Decision
(note that there might be items here that didn't appear in the first dataframe):
userId itemId Decision
0 u1 i1 0
1 u1 i2 1
2 u2 i1 1
3 u2 i3 0
4 u2 i4 1
5 u3 i5 0
...
import numpy as np
from surprise import KNNWithMeans, Dataset, Reader
from surprise.model_selection import train_test_split
reader = Reader(rating_scale=(0, 1))
data = Dataset.load_from_df(df_2[['userId', 'itenId', 'Decision']], reader)
trainset, testset = train_test_split(data, test_size=0.25)
algo = KNNWithMeans()
algo.fit(trainset)
test = algo.test(testset)
test = pd.DataFrame(test)
test.drop("details", inplace=True, axis=1)
test.columns = ['userId', 'itemId', 'actual', 'cf_predictions']
Since the output of the second code block gives me something like (for the test set!):
userId itemId actual cf_predictions
0 u3 i5 0 0.05
1 u3 i6 1 0.66
2 u4 i1 1 0.99
3 u4 i3 0 0.04
4 u5 i4 1 0.98
5 u5 i5 0 0.06
...
I would like to add the predictions from the first algorithm to the above dataframe. So an extra column with the y_pred
for each item. The problems:
y_pred
can be converted to 0's and 1's, so that is easy.- I have to somehow add
y_pred
to theX_test
set so that I see for each item the result is? - The test sets probably don't match much, so when merging the two dataframes I would barely get anything.
How can I approach this problem?
CodePudding user response:
Looks like you could use itemId
as the key to join your last dataframe with the Decision
column from your first dataframe:
import pandas as pd
df = pd.DataFrame({'itemId': ['i0', 'i1', 'i2', 'i3'],
'property_1': [88.90, 87.09, 78.90, 93.02],
'Decision': ['Good', 'Bad', 'Good', 'Bad']})
test = pd.DataFrame({'userId': ['u3', 'u3', 'u4', 'u4', 'u5', 'u5'],
'itemId': ['i5', 'i6', 'i1', 'i3', 'i4', 'i5'],
'actual': [0, 1, 1, 0, 1, 0],
'cf_predictions': [0.05, 0.66, 0.99, 0.04, 0.98, 0.06]})
predicted = df[['itemId', 'Decision']].set_index('itemId')
test.join(predicted, on='itemId')
userId itemId actual cf_predictions Decision
0 u3 i5 0 0.05 NaN
1 u3 i6 1 0.66 NaN
2 u4 i1 1 0.99 Bad
3 u4 i3 0 0.04 Bad
4 u5 i4 1 0.98 NaN
5 u5 i5 0 0.06 NaN