Predict() returns too many values instead of one-CodePudding

I am building a system that recommends a book from a dataset based on what is best for the user. The problem is that not only 1 book is returned to me, but a lot of them come out. How can I solve?

The code is this:

from sklearn.neighbors._classification import KNeighborsClassifier
import pandas as pd

class SuggestAudiobook:
def __init__(self, book):
            
    model = KNeighborsClassifier()

    book = pd.read_csv("dataset.csv", delimiter = ";")

    var2 = book.Title

    var1 = book[["audioRuntime_converted", "category_converted"]]

    var2 = var2.astype('string')
    var1 = var1.astype('int')
        
    model.fit(var1, var2)

    dataframe = pd.DataFrame(data = {"audioRuntime_converted": book.audioRuntime_converted, "category_converted": book.category_converted})

    predictionDataframe = model.predict(dataframe)

    print("L'audiobook recommended for you is --> ", predictionDataframe)

The result is this:

audiobook recommended for you is' -->  ['Catching Fire' 'In Charge of Moonlight' 'Catching Fire' ... 'Born a Crime' 'Born a Crime' 'Born a Crime']

I attach the images of the result obtained:

I'm going to recommend a book among those included in the dataset based on the data inputs. In this case the data inputs are: audioRuntime_converted and category_converted (they are found in the other file that calls the function). Then in the dataset I go to search based on those 2 fields. I am sure that the procedure is correct as applied in another project, only problem is the output which gives me more values instead of one.

CodePudding user response：

You have multiple lines in your dataframe, the .predict() function will run for every line of your dataset.

So len(predictionDataframe) == len(dataframe)

CodePudding user response：

Depending on what the input in model.predict(input), the prediction will be done for each record in the input. In your code, you seem to have input the training dataset to make a prediction, so the output is also a list of books, likely the same number of rows as the training label (var2).

I have simulated some (quite obvious) dataset for the prediction

from sklearn.neighbors._classification import KNeighborsClassifier
import pandas as pd
import numpy as np

# book = pd.read_csv("dataset.csv", delimiter = ";")
df1 = pd.concat([pd.DataFrame(np.random.uniform(0, 10, (5,2))), pd.DataFrame(['Book A']*5)], axis=1)
df2 = pd.concat([pd.DataFrame(np.random.uniform(5, 15, (5,2))), pd.DataFrame(['Book B']*5)], axis=1)
book = pd.concat([df1, df2])
book.columns = ['audioRuntime_converted', 'category_converted', 'Title']
print(book)

   audioRuntime_converted  category_converted   Title
0                3.180352            1.995319  Book A
1                5.928537            9.304618  Book A
2                3.445036            5.746906  Book A
3                3.623655            2.043251  Book A
4                8.340740            9.641824  Book A
0                7.224949            7.158453  Book B
1                9.191920           10.732677  Book B
2                7.417375            6.956461  Book B
3               10.274473           14.435836  Book B
4                5.945386           13.222845  Book B

Next I do the training and prediction:

var1 = book[["audioRuntime_converted", "category_converted"]].astype('int').values    #this is X_train
var2 = book.Title.astype('string')                                                    #this is y_train
model = KNeighborsClassifier()
model.fit(var1, var2)

test_list = [ [1,3], [3,6], [9,7], [10,12] ]    #list of user attributes [x,y]
for user in test_list:
    prediction = model.predict([user])    #input 1 user to get 1 book recommendation
    print(f"L'audiobook recommended for user {user} is --> {prediction}")

Output:

L'audiobook recommended for user [1, 3] is --> ['Book A']
L'audiobook recommended for user [3, 6] is --> ['Book A']
L'audiobook recommended for user [9, 7] is --> ['Book B']
L'audiobook recommended for user [10, 12] is --> ['Book B']

As you can see, if a user has low attributes [x,y], the recommended book is "Book A", whereas if a user has higher attributes [x,y], the recommended book is "Book B".

Also, for the input in model.predict(input), an input of 1 user attribute pair (for example [1,3]) gets 1 book recommendation.

Edit: I'm comparing difference between the code above and your other code

pd.DataFrame(data={"audioRuntime_converted": book.audioRuntime_converted, "category_converted": book.category_converted })
pd.DataFrame(data={"audioRuntime_converted": [this, is, already, series], "category_converted": [this, is, also, series]})
#that's why output is a series of prediction

pd.DataFrame(data={"audioRuntime_converted":[book.audioRuntime_converted], "average_rating_converted":[book.average_rating_converted], "ratings_count_converted":[book.ratings_count_converted]}) 
pd.DataFrame(data={"audioRuntime_converted":[there is single number here], "average_rating_converted":[ there is single number here ], "ratings_count_converted":[there isa single number here]}) 
#that's why there is only 1 prediction