I am working with the R programming language.
I have the following dataset:
library(dplyr)
var1 = rnorm(10000, 100,100)
var2 = rnorm(10000, 100,100)
var3 = rnorm(10000, 100,100)
var4 = rnorm(10000, 100,100)
id = 1:10000
final = data.frame(id, var1, var2, var3, var4)
Suppose I want to create a new variable based on decile values of another variable (e.g. var3). I can use the following code to do this:
final = final %>%
mutate(class3 = case_when(ntile(var3, 10) == 1 ~ paste0(round(min(var3), 2), " to ", round(quantile(var3, 0.1), 2), " decile 1"),
ntile(var3, 10) == 2 ~ paste0(round(quantile(var3, 0.1), 2), " to ", round(quantile(var3, 0.2), 2), " decile 2"),
ntile(var3, 10) == 3 ~ paste0(round(quantile(var3, 0.2), 2), " to ", round(quantile(var3, 0.3), 2), " decile 3"),
ntile(var3, 10) == 4 ~ paste0(round(quantile(var3, 0.3), 2), " to ", round(quantile(var3, 0.4), 2), " decile 4"),
ntile(var3, 10) == 5 ~ paste0(round(quantile(var3, 0.4), 2), " to ", round(quantile(var3, 0.5), 2), " decile 5"),
ntile(var3, 10) == 6 ~ paste0(round(quantile(var3, 0.5), 2), " to ", round(quantile(var3, 0.6), 2), " decile 6"),
ntile(var3, 10) == 7 ~ paste0(round(quantile(var3, 0.6), 2), " to ", round(quantile(var3, 0.7), 2), " decile 7"),
ntile(var3, 10) == 8 ~ paste0(round(quantile(var3, 0.7), 2), " to ", round(quantile(var3, 0.8), 2), " decile 8"),
ntile(var3, 10) == 9 ~ paste0(round(quantile(var3, 0.8), 2), " to ", round(quantile(var3, 0.9), 2), " decile 9"),
ntile(var3, 10) == 10 ~ paste0(round(quantile(var3, 0.9), 2), " to ", round(max(var3), 2), " decile 10")))
Suppose now instead of by groups of "10" , I now want to do it by groups of "5" - I would need to manually change the above code. I am looking for a more convenient way of doing this.
I tried to write this function to do this:
percentile_classifier <- function(x, n_percentiles) {
# Calculate the percentiles
percentiles <- quantile(x, probs = seq(0, 1, 1/n_percentiles))
# Create a character vector to store the labels
labels <- character(length(x))
# Loop through each percentile and assign the corresponding label to each element in the vector
for (i in 1:length(percentiles)) {
lower <- percentiles[i]
upper <- ifelse(i == length(percentiles), max(x), percentiles[i 1])
label <- paste0(round(lower, 2), " to ", round(upper, 2), " percentile ", i)
labels[x >= lower & x < upper] <- label
}
# Return the labels
return(labels)
}
But when I try to call this function:
final <- final %>% mutate(class3 = percentile_classifier(var3, 20))
I am not sure if I have done this correctly.
Can someone please tell me if I have done this correctly?
Thanks!
CodePudding user response:
Your code is very close, but not quite right. This is clear if we run it on a nice small example instead of a 10,000 row data frame:
set.seed(42)
n = 15
var1 = rnorm(n, 100,100)
id = 1:n
final = data.frame(id, var1)
# final %>% mutate(result = percentile_classifier(var1, 5))
# id var1 result
# 1 1 237.09584 149.57 to 239.91 percentile 4
# 2 2 43.53018 -38.89 to 83.76 percentile 1
# 3 3 136.31284 92.45 to 149.57 percentile 3
# 4 4 163.28626 149.57 to 239.91 percentile 4
# 5 5 140.42683 92.45 to 149.57 percentile 3
# 6 6 89.38755 83.76 to 92.45 percentile 2
# 7 7 251.15220 239.91 to 328.66 percentile 5
# 8 8 90.53410 83.76 to 92.45 percentile 2
# 9 9 301.84237 239.91 to 328.66 percentile 5
# 10 10 93.72859 92.45 to 149.57 percentile 3
# 11 11 230.48697 149.57 to 239.91 percentile 4
# 12 12 328.66454 ## Problem - the max isn't labeled
# 13 13 -38.88607 -38.89 to 83.76 percentile 1
# 14 14 72.12112 -38.89 to 83.76 percentile 1
# 15 15 86.66787 83.76 to 92.45 percentile 2
We can see that the max value doesn't get a label. Knowing that's the problem, the bug is easy to find. Your function assigns the labels with labels[x >= lower & x < upper] <- label
, but of course the maximum value will never be strictly less than itself, the highest upper
value. You could fix the bug by treating it as a special case.
Instead, let's rely on the base function cut
which is R's built-in way to bin values like this. (See the ?cut
help page or the Stack Overflow FAQ How to bin numeric values for more details.)
cut
has a built-in argument for whether or not you want the endpoints included. It also defaults to intervals excluding the lowest point and including the highest, but that is customizable with the argument right
if you want to change it.
ptile <- function(x, n_percentiles) {
# Calculate the percentiles
pct <- quantile(x, probs = seq(0, 1, 1/n_percentiles))
# Create a character vector to store the labels
labels <- sprintf("%.2f to %.2f percentile %d",
head(pct, -1), tail(pct, -1), seq_len(n_percentiles))
cut(x, breaks = pct, labels = labels, include.lowest = TRUE)
}
final %>% mutate(result = ptile(var1, 5))
# id var1 result
# 1 1 237.09584 149.57 to 239.91 percentile 4
# 2 2 43.53018 -38.89 to 83.76 percentile 1
# 3 3 136.31284 92.45 to 149.57 percentile 3
# 4 4 163.28626 149.57 to 239.91 percentile 4
# 5 5 140.42683 92.45 to 149.57 percentile 3
# 6 6 89.38755 83.76 to 92.45 percentile 2
# 7 7 251.15220 239.91 to 328.66 percentile 5
# 8 8 90.53410 83.76 to 92.45 percentile 2
# 9 9 301.84237 239.91 to 328.66 percentile 5
# 10 10 93.72859 92.45 to 149.57 percentile 3
# 11 11 230.48697 149.57 to 239.91 percentile 4
# 12 12 328.66454 239.91 to 328.66 percentile 5
# 13 13 -38.88607 -38.89 to 83.76 percentile 1
# 14 14 72.12112 -38.89 to 83.76 percentile 1
# 15 15 86.66787 83.76 to 92.45 percentile 2