How is performance of a machine learning models evaluated in field for-CodePudding

Consider use cases like

lending money - ML model predicts that lending money is safe to an individual.
predictive maintenance in which a machine learning model predicts that an equipment will not fail.

In above cases, it is easy to find if the ML model's prediction was correct or not depending on whether the money was paid back or not and whether the equipment part failed or not.

How is performance of a model evaluated for the following scenarios? Am I correct that it is not possible to evaluate performance for the following scenarios?

lending money - ML model predicts that lending money is NOT safe to an individual and money is not lend.
predictive maintenance in which a machine learning model predicts that an equipment will fail and equipment is thus replaced.

In general, would I be correct is saying that some predictions can be evaluated but some can't be? For scenarios where the performance can't be evaluated, how do businesses ensure that they are not losing opportunities due to incorrect predictions? I am guessing that there is no way to do this as this problem exists in general without use of ML models as well. Just putting my doubt/question here to validate my thought process.

CodePudding user response：

If you think about it, both groups are referring to same models, just different use cases. If you take the model predicting whether it's safe to lend money and invert it's prediction, you would get a prediction whether it's NOT safe to lend money.

And if you use your model to predict safe lending you would still care about increasing recall (i.e. reducing number of safe cases that are classified as unsafe).

Some predictions can't be evaluated if we act on them (if we denied lending we can't tell whether the model was right). Another related problem is gathering a good dataset to train the model further: usually we would train the model on the data we observed, and if we deny 90% of applications based on the current model prediction, then in the future we can only train next model on the remaining 10% of the applications.

However, there are some ways to work around this:

Turning the model off for some percentage of applications. Let's say that random 1% of applications are approved regardless of model prediction. This will get us an unbiased dataset evaluate the model.
Using historical data, that was gathered before the model was introduced.
Finding a proxy metric that correlates with business-metric, but is easier to evaluate. As an example you could measure percentage of applicants who within 1 year after their applications made late-payments (with other lenders, not us) among the applicants who have been approved vs rejected by our model. The higher the difference of this metric between rejected and approved groups, the better our model performs. But for this to work you have to prove that this metric correlates with probability of our lending being unsafe.