What format should the Y in sklearn be?-CodePudding

Does Y have to be one-hot encoded or not? For example in this code:

from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
clf.fit(X, Y)

CodePudding user response：

Y should be one column of your dataframe

CodePudding user response：

No, y should usually not be one-hot encoded.

In general, one-hot encoding or dummy encoding is used to encode features (columns in X), not the target y.

In machine learning tasks, y is very often a single column from your data, represented as a 1d array, a list, or a Pandas Series. Notice the example in the docs:

import numpy as np

X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
Y = np.array([1, 1, 1, 2, 2, 2])

from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
clf.fit(X, Y)

y is usually written with a lower-case letter (though not always, as you see above), because conventional mathematical notation uses upper-case letters like X for matrices (often represented as 2d arrays in NumPy, or as a DataFrame in Pandas), and lower-case for vectors (1d arrays or Series in Pandas).

CodePudding user response：

Did you check the documentation ? https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html?highlight=gaussiannb#sklearn.naive_bayes.GaussianNB.fit