I should find the optimal threshold to minimize both the false positive rate and false negative rate. An equal weight between these two rates should be assumed. I write the following code:
data=read.csv( url("https://raw.githubusercontent.com/propublica/compas-analysis/master/compas-scores-two-years.csv"), sep=",")
library(ROCR)
pred=prediction(data$decile_score/10, data$two_year_recid)
perf=performance(pred, measure="fnr",x.measure="fpr")
opt.cut = function(perf, pred)
{
cut.ind = mapply(FUN=function(x, y, p){
d = (x - 0)^2 (y-1)^2
ind = which(d == min(d))
c(False_negative_rate = 1-y[[ind]], False_positive_rate = x[[ind]],
cutoff = p[[ind]])
}, [email protected], [email protected], pred@cutoffs)
}
print(opt.cut(perf, pred))
It throws out this result:
[,1]
False_negative_rate 0
False_positive_rate 0
cutoff Inf
However, I think there is something wrong with my code.
CodePudding user response:
Well, I think your code is flawed from a logical point of view. You said You want to
minimize both the false positive rate and false negative rate
But then you minimize
d = (x - 0)^2 (y-1)^2
which is 1 - FNR
which is the True Positive Rate.
Thus, assuming you want to minimize FPR and FNR you could simply do:
pred@cutoffs[[1]][which.min(sqrt([email protected][[1]] ^ 2 [email protected][[1]] ^ 2))]
# [1] 0.5
(no need to use extra loops as R
is nicely vectorized)
To verify this result, you can simply calculate FPR and FNR yourself for different cutoffs (which will give you the same results as performance
of course, but it is a good exercise to understand the principles):
t(sapply(pred@cutoffs[[1]], function(co) {
prediction <- factor(ifelse(data$decile_score / 10 < co, 0, 1), 0:1)
confusion_matrix <- table(data$two_year_recid, prediction)
fpr <- confusion_matrix[1, 2] / sum(confusion_matrix[1, ])
fnr <- confusion_matrix[2, 1] / sum(confusion_matrix[2, ])
c(cutoff = co, fpr = fpr, fnr = fnr, dist = sqrt(fpr ^ 2 fnr ^2))
}))
# cutoff fpr fnr dist
# [1,] Inf 0.00000000 1.00000000 1.0000000
# [2,] 1.0 0.02195307 0.90895109 0.9092162
# [3,] 0.9 0.06056018 0.79975392 0.8020436
# [4,] 0.8 0.10143830 0.69209474 0.6994890
# [5,] 0.7 0.16250315 0.58443556 0.6066071
# [6,] 0.6 0.23391370 0.47431560 0.5288581
# [7,] 0.5 0.32349230 0.37403876 0.4945223 #### <<- Minimum
# [8,] 0.4 0.43325763 0.27130114 0.5111912
# [9,] 0.3 0.55084532 0.18486620 0.5810388
# [10,] 0.2 0.71435781 0.09474008 0.7206128
# [11,] 0.1 1.00000000 0.00000000 1.0000000
CodePudding user response:
The first values in [email protected]
, [email protected]
, pred@cutoffs
are causing your results, they are 1, 0 and Inf, respectively. In order to remove them, loop
through each list member and extract the vectors without the 1st element.
library(ROCR)
opt.cut = function(perf, pred) {
#
x.values <- lapply([email protected], `[`, -1)
y.values <- lapply([email protected], `[`, -1)
cutoffs <- lapply(pred@cutoffs, `[`, -1)
#
cut.ind <- mapply(FUN=function(x, y, p){
d <- x^2 y^2
ind <- which.min(d)
c(False_negative_rate = y[[ind]],
False_positive_rate = x[[ind]],
cutoff = p[[ind]])
}, x.values, y.values, cutoffs)
cut.ind
}
pred <- prediction(data$decile_score/10, data$two_year_recid)
perf <- performance(pred, measure = "fnr", x.measure = "fpr")
opt.cut(perf, pred)
# [,1]
#False_negative_rate 0.3740388
#False_positive_rate 0.3234923
#cutoff 0.5000000