R: Difficulty Calculating Percentiles?-CodePudding

I am working with the R programming language.

I have the following dataset:

library(dplyr)

set.seed(123)
gender <- factor(sample(c("Male", "Female"), 5000, replace=TRUE, prob=c(0.45, 0.55)))
status <- factor(sample(c("Immigrant", "Citizen"), 5000, replace=TRUE, prob=c(0.3, 0.7)))
country <- factor(sample(c("A", "B", "C", "D"), 5000, replace=TRUE, prob=c(0.25, 0.25, 0.25, 0.25)))
disease <- factor(sample(c("Yes", "No"), 5000, replace=TRUE, prob=c(0.4, 0.6)))

my_data <- data.frame(gender, status, disease, country, var1 = rnorm(5000, 5000, 5000), var2 = rnorm(5000, 5000, 5000))

I then have this function used to calculate arbitrary percentiles for variables :

# source: https://stackoverflow.com/questions/74947154/r-using-dplyr-to-perform-conditional-functions
ptile <- function(x, n_percentiles) {
  # Calculate the percentiles
  pct <- quantile(x, probs = seq(0, 1, 1/n_percentiles))

  # Create a character vector to store the labels
  labels <- sprintf("%.2f to %.2f percentile %d",
                    head(pct, -1), tail(pct, -1), seq_len(n_percentiles))

  cut(x, breaks = pct, labels = labels, include.lowest = TRUE)
}

When I use this function sometimes:

# error not produced on this dataset, but on other datasets

na.omit(my_data) %>%
    group_by(gender, status, country) %>%
    mutate(result1 = ptile(var1, 10), result2 = ptile(var2, 5))

I get one of two errors:

Error in cut.default(x, breaks = pct, lables = labels, include.lowest = TRUE): invalid number of intervals
Error in cut.default(x, breaks = pct, labels = labels, include.lowest = TRUE) : 'breaks' are not unique

At first I thought that these errors are being produced because I am using this function on "groups of rows" - and in some of these rows, there might be too few rows for the desired percentile to be calculated?

I had originally thought that perhaps I could fix this problem by excluding "groups" with an insufficient number of rows:

na.omit(my_data) %>%
    group_by(gender, status, country) %>%
filter(n() < 5) %>%
    mutate(result1 = ptile(var1, 10), result2 = ptile(var2, 5))

But the same errors still persist.

I was wondering - is there some way to modify this percentile function such that when percentiles might not be able to be calculated at the desired level, the next closest level of percentiles can be calculated?

As an example, if I want percentiles at groups of 10 and this is not possible - perhaps percentiles at groups of 15 or groups of 20 might be possible?

As another example, suppose some group of observations (e.g. male, immigrant, country A) only has 1 observation and I want percentiles in groups of 10 - naturally, it seems that this is not possible. Without knowing that such a group exists in advance, is it possible to modify this ptile function such that it either ignores this group or just calculates the closest possible percentile (e.g. places everything into 1)?

In general, how can I change this ptile function so that these errors can be fixed?

Can someone please suggest a way to do this?

Thanks!

CodePudding user response：

This is not really an answer, but I am quite sure it will help.

First I thought this is the reason:

The reason is that in var 1 and var 2 length of break and label is not equal:

pct_var1 <- quantile(my_data$var1, probs = seq(0, 1, 1/10))
length(pct_var1)
# Create a character vector to store the labels
labels_var1 <- sprintf("%.2f to %.2f percentile %d",
                  head(pct, -1), tail(pct, -1), seq_len(10))
length(labels_var1)
cut_var1 <- cut(my_data$var1, breaks = pct, labels = labels, include.lowest = TRUE)


pct_var2 <- quantile(my_data$var2, probs = seq(0, 1, 1/5))
length(pct_var2)
# Create a character vector to store the labels
labels_var2 <- sprintf("%.2f to %.2f percentile %d",
                  head(pct, -1), tail(pct, -1), seq_len(5))
length(labels_var2)
cut_var2 <- cut(my_data$var2, breaks = pct, labels = labels, include.lowest = TRUE)

> length(pct_var1)
[1] 11
> length(labels_var1)
[1] 10
> length(pct_var2)
[1] 6               
> length(labels_var2)
[1] 5

In combination with this post Cut and labels/breaks length conflict it should be manageable to solve.

But then I encountered that your second data frame example:

na.omit(my_data) %>%
  group_by(gender, status, country) %>%
  filter(n() < 5)

removes all data

CodePudding user response：

Various attempts to solve this problem on my own:

Attempt 1:

final_answer = my_data %>% group_by(gender, status, country) %>% 
  mutate(result1 = case_when(ntile(var1, 10) == 1 ~ paste0(round(min(var1), 2), " to ", round(quantile(var1, 0.1), 2), " group 1"),
                            ntile(var1, 10) == 2 ~ paste0(round(quantile(var1, 0.1), 2), " to ", round(quantile(var1, 0.2), 2), " group 2"),
                            ntile(var1, 10) == 3 ~ paste0(round(quantile(var1, 0.2), 2), " to ", round(quantile(var1, 0.3), 2), " group 3"),
                            ntile(var1, 10) == 4 ~ paste0(round(quantile(var1, 0.3), 2), " to ", round(quantile(var1, 0.4), 2), " group 4"),
                            ntile(var1, 10) == 5 ~ paste0(round(quantile(var1, 0.4), 2), " to ", round(quantile(var1, 0.5), 2), " group 5"),
                            ntile(var1, 10) == 6 ~ paste0(round(quantile(var1, 0.5), 2), " to ", round(quantile(var1, 0.6), 2), " group 6"),
                            ntile(var1, 10) == 7 ~ paste0(round(quantile(var1, 0.6), 2), " to ", round(quantile(var1, 0.7), 2), " group 7"),
                            ntile(var1, 10) == 8 ~ paste0(round(quantile(var1, 0.7), 2), " to ", round(quantile(var1, 0.8), 2), " group 8"),
                            ntile(var1, 10) == 9 ~ paste0(round(quantile(var1, 0.8), 2), " to ", round(quantile(var1, 0.9), 2), " group 9"),
                            ntile(var1, 10) == 10 ~ paste0(round(quantile(var1, 0.9), 2), " to ", round(max(var1), 2), " group 10"))) %>% 
  mutate(result2 = case_when(ntile(var2, 10) == 1 ~ paste0(round(min(var2), 2), " to ", round(quantile(var2, 0.1), 2), " group 1"),
                            ntile(var2, 10) == 2 ~ paste0(round(quantile(var2, 0.1), 2), " to ", round(quantile(var2, 0.2), 2), " group 2"),
                            ntile(var2, 10) == 3 ~ paste0(round(quantile(var2, 0.2), 2), " to ", round(quantile(var2, 0.3), 2), " group 3"),
                            ntile(var2, 10) == 4 ~ paste0(round(quantile(var2, 0.3), 2), " to ", round(quantile(var2, 0.4), 2), " group 4"),
                            ntile(var2, 10) == 5 ~ paste0(round(quantile(var2, 0.4), 2), " to ", round(quantile(var2, 0.5), 2), " group 5"))

Attempt 2:

percentile_classifier <- function(x, n_percentiles) {
  # Calculate the percentiles
  percentiles <- quantile(x, probs = seq(0, 1, 1/n_percentiles))

  # Create a character vector to store the labels
  labels <- character(length(x))

  # Loop through each percentile and assign the corresponding label to each element in the vector
  for (i in 1:length(percentiles)) {
    lower <- percentiles[i]
    upper <- ifelse(i == length(percentiles), max(x), percentiles[i 1])
    label <- paste0(round(lower, 2), " to ", round(upper, 2), " percentile ", i)
    labels[x >= lower & x < upper] <- label
  }

  # Return the labels
  return(labels)
}

final <- my_data %>% group_by(gender, status, country) %>% mutate(result1 = percentile_classifier(var1, 10))  %>% mutate(result2 = percentile_classifier(var2, 5))

CodePudding user response：

If we need to bypass some of these errors i.e. when the number of elements in the group are less, an option is to use tryCatch or purrr::possibly

library(dplyr)
library(purrr)
f_ptile <- possibly(ptile, otherwise = factor(NA_character_))

-testing

na.omit(my_data) %>%
     group_by(gender, status, country) %>%
     slice(1) %>% # this fails with original ptile function
     mutate(result1 = f_ptile(var1, 10), result2 = f_ptile(var2, 5))
# A tibble: 16 × 8
# Groups:   gender, status, country [16]
   gender status    disease country    var1   var2 result1 result2
   <fct>  <fct>     <fct>   <fct>     <dbl>  <dbl> <fct>   <fct>  
 1 Female Citizen   No      A       11902.  -2436. <NA>    <NA>   
 2 Female Citizen   Yes     B        3508.   6851. <NA>    <NA>   
 3 Female Citizen   Yes     C       16854.  -1769. <NA>    <NA>   
 4 Female Citizen   No      D       12372.   4363. <NA>    <NA>   
 5 Female Immigrant Yes     A        9635.    695. <NA>    <NA>   
 6 Female Immigrant No      B        3120.  -2674. <NA>    <NA>   
 7 Female Immigrant No      C       -1635.   3971. <NA>    <NA>   
 8 Female Immigrant No      D       -3163.   5802. <NA>    <NA>   
 9 Male   Citizen   No      A        3836.   8196. <NA>    <NA>   
10 Male   Citizen   No      B        6125.   8096. <NA>    <NA>   
11 Male   Citizen   Yes     C        2159.   9863. <NA>    <NA>   
12 Male   Citizen   No      D        3622.   8159. <NA>    <NA>   
13 Male   Immigrant No      A       -3003.   6874. <NA>    <NA>   
14 Male   Immigrant Yes     B         -27.1  4063. <NA>    <NA>   
15 Male   Immigrant No      C        4166.   2103. <NA>    <NA>   
16 Male   Immigrant No      D        5718.  17866. <NA>    <NA>