I have a balanced dataset used for model training purposes. There are two classes. My model has a precision of 50%, meaning that for 100 samples it predicts that 50 are positive, of those 50 only 25 are actually positive. The model is basically as good as flipping a coin.
Now in production, the data is highly unbalanced, say only 4 out of 100 samples are positive. Will my model still have the same precision?
The way I understand it is that my coin-flip model would then label 50 samples as positive, of which only 2 would actually be positive so precision would be 4% (2/50) in production.
Is it true that a model that was trained on a balanced dataset would have a different precision in production?
CodePudding user response:
That depends: of those 50 samples classified as positive, are all 25 true positive samples correctly classified? If your model correctly predicts every positive sample as positive and then also negative samples as positive (high sensitivity, low specificity), I think your precision would be at around 8%. Nevertheless, you should revisit your training, since fpr 50% precision you don't need a ML model but rather a one-liner generating a random variable between 0 and 1.