Is sklearn.naive_bayes.CategoricalNB
the same as sklearn.naive_bayes.BernoulliNB
, but with one hot encoding in the columns?
Couldn't quite guess from documentation, and CategoricalNB
has that one extra parameter alpha
whose purpose I don't understand.
CodePudding user response:
The categorical distribution is the Bernoulli distribution, generalized to more than two categories. Stated another way, the Bernoulli distribution is a special case of the categorical distribution, with exactly 2 categories.
In the Bernoulli model, each feature is assumed to have exactly 2 categories, often denoted as 1 and 0 or True and False. In the categorical model, each feature is assumed to have at least 2 categories, and each feature may have a different total number of categories.
One-hot encoding is unrelated to either model. It is a technique for encoding a categorical variable in a numerical matrix. It has no bearing on the actual distribution used to model that categorical variable, although it is natural to model categorical variables using the categorical distribution.
The "alpha" parameter is called the Laplace smoothing parameter. I will not go into detail about it here, because that is better suited for CrossValidated, e.g. https://stats.stackexchange.com/q/192233/36229. From a computational perspective, it exists in order to prevent "poisoning" the calculations with 0s, which propagate multiplicatively throughout the model. This is a practical concern that arises whenever some combination of class label and feature category is not present in your data set. It's fine to leave it at the default value of 1.