Home > OS >  R: Creating a Function to Identify Arbitrary Percentiles
R: Creating a Function to Identify Arbitrary Percentiles

Time:12-27

I am working with the R programming language.

I have the following dataset:

library(dplyr)

var1 = rnorm(10000, 100,100)
var2 = rnorm(10000, 100,100)
var3 = rnorm(10000, 100,100)
var4 = rnorm(10000, 100,100)
id = 1:10000

final = data.frame(id, var1, var2, var3, var4)

Suppose I want to create a new variable based on decile values of another variable (e.g. var3). I can use the following code to do this:

final = final %>%
  mutate(class3 = case_when(ntile(var3, 10) == 1 ~ paste0(round(min(var3), 2), " to ", round(quantile(var3, 0.1), 2), " decile 1"),
                            ntile(var3, 10) == 2 ~ paste0(round(quantile(var3, 0.1), 2), " to ", round(quantile(var3, 0.2), 2), " decile 2"),
                            ntile(var3, 10) == 3 ~ paste0(round(quantile(var3, 0.2), 2), " to ", round(quantile(var3, 0.3), 2), " decile 3"),
                            ntile(var3, 10) == 4 ~ paste0(round(quantile(var3, 0.3), 2), " to ", round(quantile(var3, 0.4), 2), " decile 4"),
                            ntile(var3, 10) == 5 ~ paste0(round(quantile(var3, 0.4), 2), " to ", round(quantile(var3, 0.5), 2), " decile 5"),
                            ntile(var3, 10) == 6 ~ paste0(round(quantile(var3, 0.5), 2), " to ", round(quantile(var3, 0.6), 2), " decile 6"),
                            ntile(var3, 10) == 7 ~ paste0(round(quantile(var3, 0.6), 2), " to ", round(quantile(var3, 0.7), 2), " decile 7"),
                            ntile(var3, 10) == 8 ~ paste0(round(quantile(var3, 0.7), 2), " to ", round(quantile(var3, 0.8), 2), " decile 8"),
                            ntile(var3, 10) == 9 ~ paste0(round(quantile(var3, 0.8), 2), " to ", round(quantile(var3, 0.9), 2), " decile 9"),
                            ntile(var3, 10) == 10 ~ paste0(round(quantile(var3, 0.9), 2), " to ", round(max(var3), 2), " decile 10")))

Suppose now instead of by groups of "10" , I now want to do it by groups of "5" - I would need to manually change the above code. I am looking for a more convenient way of doing this.

I tried to write this function to do this:

percentile_classifier <- function(x, n_percentiles) {
  # Calculate the percentiles
  percentiles <- quantile(x, probs = seq(0, 1, 1/n_percentiles))

  # Create a character vector to store the labels
  labels <- character(length(x))

  # Loop through each percentile and assign the corresponding label to each element in the vector
  for (i in 1:length(percentiles)) {
    lower <- percentiles[i]
    upper <- ifelse(i == length(percentiles), max(x), percentiles[i 1])
    label <- paste0(round(lower, 2), " to ", round(upper, 2), " percentile ", i)
    labels[x >= lower & x < upper] <- label
  }

  # Return the labels
  return(labels)
}

But when I try to call this function:

final <- final %>% mutate(class3 = percentile_classifier(var3, 20))

I am not sure if I have done this correctly.

Can someone please tell me if I have done this correctly?

Thanks!

CodePudding user response:

Your code is very close, but not quite right. This is clear if we run it on a nice small example instead of a 10,000 row data frame:

set.seed(42)
n = 15
var1 = rnorm(n, 100,100)
id = 1:n

final = data.frame(id, var1)

# final %>% mutate(result = percentile_classifier(var1, 5))
#    id      var1                        result
# 1   1 237.09584 149.57 to 239.91 percentile 4
# 2   2  43.53018  -38.89 to 83.76 percentile 1
# 3   3 136.31284  92.45 to 149.57 percentile 3
# 4   4 163.28626 149.57 to 239.91 percentile 4
# 5   5 140.42683  92.45 to 149.57 percentile 3
# 6   6  89.38755   83.76 to 92.45 percentile 2
# 7   7 251.15220 239.91 to 328.66 percentile 5
# 8   8  90.53410   83.76 to 92.45 percentile 2
# 9   9 301.84237 239.91 to 328.66 percentile 5
# 10 10  93.72859  92.45 to 149.57 percentile 3
# 11 11 230.48697 149.57 to 239.91 percentile 4
# 12 12 328.66454                               ## Problem - the max isn't labeled
# 13 13 -38.88607  -38.89 to 83.76 percentile 1
# 14 14  72.12112  -38.89 to 83.76 percentile 1
# 15 15  86.66787   83.76 to 92.45 percentile 2

We can see that the max value doesn't get a label. Knowing that's the problem, the bug is easy to find. Your function assigns the labels with labels[x >= lower & x < upper] <- label, but of course the maximum value will never be strictly less than itself, the highest upper value. You could fix the bug by treating it as a special case.

Instead, let's rely on the base function cut which is R's built-in way to bin values like this. (See the ?cut help page or the Stack Overflow FAQ How to bin numeric values for more details.)

cut has a built-in argument for whether or not you want the endpoints included. It also defaults to intervals excluding the lowest point and including the highest, but that is customizable with the argument right if you want to change it.

ptile <- function(x, n_percentiles) {
  # Calculate the percentiles
  pct <- quantile(x, probs = seq(0, 1, 1/n_percentiles))

  # Create a character vector to store the labels
  labels <- sprintf("%.2f to %.2f percentile %d",
                    head(pct, -1), tail(pct, -1), seq_len(n_percentiles))

  cut(x, breaks = pct, labels = labels, include.lowest = TRUE)
}

final %>% mutate(result = ptile(var1, 5))
#    id      var1                        result
# 1   1 237.09584 149.57 to 239.91 percentile 4
# 2   2  43.53018  -38.89 to 83.76 percentile 1
# 3   3 136.31284  92.45 to 149.57 percentile 3
# 4   4 163.28626 149.57 to 239.91 percentile 4
# 5   5 140.42683  92.45 to 149.57 percentile 3
# 6   6  89.38755   83.76 to 92.45 percentile 2
# 7   7 251.15220 239.91 to 328.66 percentile 5
# 8   8  90.53410   83.76 to 92.45 percentile 2
# 9   9 301.84237 239.91 to 328.66 percentile 5
# 10 10  93.72859  92.45 to 149.57 percentile 3
# 11 11 230.48697 149.57 to 239.91 percentile 4
# 12 12 328.66454 239.91 to 328.66 percentile 5
# 13 13 -38.88607  -38.89 to 83.76 percentile 1
# 14 14  72.12112  -38.89 to 83.76 percentile 1
# 15 15  86.66787   83.76 to 92.45 percentile 2
  •  Tags:  
  • r
  • Related