I am building a system that recommends a book from a dataset based on what is best for the user. The problem is that not only 1 book is returned to me, but a lot of them come out. How can I solve? The code is this:
from sklearn.neighbors._classification import KNeighborsClassifier
import pandas as pd
class SuggestAudiobook:
def __init__(self, book):
model = KNeighborsClassifier()
book = pd.read_csv("dataset.csv", delimiter = ";")
var2 = book.Title
var1 = book[["audioRuntime_converted", "category_converted"]]
var2 = var2.astype('string')
var1 = var1.astype('int')
model.fit(var1, var2)
dataframe = pd.DataFrame(data = {"audioRuntime_converted": book.audioRuntime_converted, "category_converted": book.category_converted})
predictionDataframe = model.predict(dataframe)
print("L'audiobook recommended for you is --> ", predictionDataframe)
The result is this:
audiobook recommended for you is' --> ['Catching Fire' 'In Charge of Moonlight' 'Catching Fire' ... 'Born a Crime' 'Born a Crime' 'Born a Crime']
I attach the images of the result obtained:
I'm going to recommend a book among those included in the dataset based on the data inputs. In this case the data inputs are: audioRuntime_converted
and category_converted
(they are found in the other file that calls the function). Then in the dataset I go to search based on those 2 fields. I am sure that the procedure is correct as applied in another project, only problem is the output which gives me more values instead of one.
CodePudding user response:
You have multiple lines in your dataframe, the .predict()
function will run for every line of your dataset.
So len(predictionDataframe) == len(dataframe)
CodePudding user response:
Depending on what the input in model.predict(input)
, the prediction will be done for each record in the input. In your code, you seem to have input the training dataset to make a prediction, so the output is also a list of books, likely the same number of rows as the training label (var2
).
I have simulated some (quite obvious) dataset for the prediction
from sklearn.neighbors._classification import KNeighborsClassifier
import pandas as pd
import numpy as np
# book = pd.read_csv("dataset.csv", delimiter = ";")
df1 = pd.concat([pd.DataFrame(np.random.uniform(0, 10, (5,2))), pd.DataFrame(['Book A']*5)], axis=1)
df2 = pd.concat([pd.DataFrame(np.random.uniform(5, 15, (5,2))), pd.DataFrame(['Book B']*5)], axis=1)
book = pd.concat([df1, df2])
book.columns = ['audioRuntime_converted', 'category_converted', 'Title']
print(book)
audioRuntime_converted category_converted Title
0 3.180352 1.995319 Book A
1 5.928537 9.304618 Book A
2 3.445036 5.746906 Book A
3 3.623655 2.043251 Book A
4 8.340740 9.641824 Book A
0 7.224949 7.158453 Book B
1 9.191920 10.732677 Book B
2 7.417375 6.956461 Book B
3 10.274473 14.435836 Book B
4 5.945386 13.222845 Book B
Next I do the training and prediction:
var1 = book[["audioRuntime_converted", "category_converted"]].astype('int').values #this is X_train
var2 = book.Title.astype('string') #this is y_train
model = KNeighborsClassifier()
model.fit(var1, var2)
test_list = [ [1,3], [3,6], [9,7], [10,12] ] #list of user attributes [x,y]
for user in test_list:
prediction = model.predict([user]) #input 1 user to get 1 book recommendation
print(f"L'audiobook recommended for user {user} is --> {prediction}")
Output:
L'audiobook recommended for user [1, 3] is --> ['Book A']
L'audiobook recommended for user [3, 6] is --> ['Book A']
L'audiobook recommended for user [9, 7] is --> ['Book B']
L'audiobook recommended for user [10, 12] is --> ['Book B']
As you can see, if a user has low attributes [x,y], the recommended book is "Book A", whereas if a user has higher attributes [x,y], the recommended book is "Book B".
Also, for the input in model.predict(input)
, an input of 1 user attribute pair (for example [1,3]
) gets 1 book recommendation.