How to find the optimal cut-off point to minimize both the FNR and FPR in R?-CodePudding

I should find the optimal threshold to minimize both the false positive rate and false negative rate. An equal weight between these two rates should be assumed. I write the following code:

data=read.csv( url("https://raw.githubusercontent.com/propublica/compas-analysis/master/compas-scores-two-years.csv"), sep=",")
library(ROCR)
pred=prediction(data$decile_score/10, data$two_year_recid)
perf=performance(pred, measure="fnr",x.measure="fpr")

opt.cut = function(perf, pred)
{
    cut.ind = mapply(FUN=function(x, y, p){
        d = (x - 0)^2   (y-1)^2
        ind = which(d == min(d))
        c(False_negative_rate = 1-y[[ind]], False_positive_rate = x[[ind]], 
            cutoff = p[[ind]])
    }, [email protected], [email protected], pred@cutoffs)
}

print(opt.cut(perf, pred))

It throws out this result:

                   [,1]
False_negative_rate    0
False_positive_rate    0
cutoff               Inf

However, I think there is something wrong with my code.

CodePudding user response：

Well, I think your code is flawed from a logical point of view. You said You want to

minimize both the false positive rate and false negative rate

But then you minimize

d = (x - 0)^2 (y-1)^2

which is 1 - FNR which is the True Positive Rate.

Thus, assuming you want to minimize FPR and FNR you could simply do:

pred@cutoffs[[1]][which.min(sqrt([email protected][[1]] ^ 2   [email protected][[1]] ^ 2))]

# [1] 0.5

(no need to use extra loops as R is nicely vectorized)

To verify this result, you can simply calculate FPR and FNR yourself for different cutoffs (which will give you the same results as performance of course, but it is a good exercise to understand the principles):

 t(sapply(pred@cutoffs[[1]], function(co) {
   prediction <- factor(ifelse(data$decile_score / 10 < co, 0, 1), 0:1)
   confusion_matrix <- table(data$two_year_recid, prediction)
   fpr <- confusion_matrix[1, 2] / sum(confusion_matrix[1, ])
   fnr <- confusion_matrix[2, 1] / sum(confusion_matrix[2, ])
   c(cutoff = co, fpr = fpr, fnr = fnr, dist = sqrt(fpr ^ 2   fnr ^2))
}))

#       cutoff        fpr        fnr      dist
#  [1,]    Inf 0.00000000 1.00000000 1.0000000
#  [2,]    1.0 0.02195307 0.90895109 0.9092162
#  [3,]    0.9 0.06056018 0.79975392 0.8020436
#  [4,]    0.8 0.10143830 0.69209474 0.6994890
#  [5,]    0.7 0.16250315 0.58443556 0.6066071
#  [6,]    0.6 0.23391370 0.47431560 0.5288581
#  [7,]    0.5 0.32349230 0.37403876 0.4945223 #### <<- Minimum
#  [8,]    0.4 0.43325763 0.27130114 0.5111912
#  [9,]    0.3 0.55084532 0.18486620 0.5810388
# [10,]    0.2 0.71435781 0.09474008 0.7206128
# [11,]    0.1 1.00000000 0.00000000 1.0000000

CodePudding user response：

The first values in [email protected], [email protected], pred@cutoffs are causing your results, they are 1, 0 and Inf, respectively. In order to remove them, loop through each list member and extract the vectors without the 1st element.

library(ROCR)

opt.cut = function(perf, pred) {
  #
  x.values <- lapply([email protected], `[`, -1)
  y.values <- lapply([email protected], `[`, -1)
  cutoffs <- lapply(pred@cutoffs, `[`, -1)
  #
  cut.ind <- mapply(FUN=function(x, y, p){
    d <- x^2   y^2
    ind <- which.min(d)
    c(False_negative_rate = y[[ind]], 
      False_positive_rate = x[[ind]], 
      cutoff = p[[ind]])
  }, x.values, y.values, cutoffs)
  cut.ind
}

pred <- prediction(data$decile_score/10, data$two_year_recid)
perf <- performance(pred, measure = "fnr", x.measure = "fpr")

opt.cut(perf, pred)
#                         [,1]
#False_negative_rate 0.3740388
#False_positive_rate 0.3234923
#cutoff              0.5000000