I am working with the R programming language.
I have the following dataset:
library(dplyr)
set.seed(123)
gender <- factor(sample(c("Male", "Female"), 5000, replace=TRUE, prob=c(0.45, 0.55)))
status <- factor(sample(c("Immigrant", "Citizen"), 5000, replace=TRUE, prob=c(0.3, 0.7)))
country <- factor(sample(c("A", "B", "C", "D"), 5000, replace=TRUE, prob=c(0.25, 0.25, 0.25, 0.25)))
disease <- factor(sample(c("Yes", "No"), 5000, replace=TRUE, prob=c(0.4, 0.6)))
my_data <- data.frame(gender, status, disease, country, var1 = rnorm(5000, 5000, 5000), var2 = rnorm(5000, 5000, 5000))
I then have this function used to calculate arbitrary percentiles for variables :
# source: https://stackoverflow.com/questions/74947154/r-using-dplyr-to-perform-conditional-functions
ptile <- function(x, n_percentiles) {
# Calculate the percentiles
pct <- quantile(x, probs = seq(0, 1, 1/n_percentiles))
# Create a character vector to store the labels
labels <- sprintf("%.2f to %.2f percentile %d",
head(pct, -1), tail(pct, -1), seq_len(n_percentiles))
cut(x, breaks = pct, labels = labels, include.lowest = TRUE)
}
When I use this function sometimes:
# error not produced on this dataset, but on other datasets
na.omit(my_data) %>%
group_by(gender, status, country) %>%
mutate(result1 = ptile(var1, 10), result2 = ptile(var2, 5))
I get one of two errors:
Error in cut.default(x, breaks = pct, lables = labels, include.lowest = TRUE): invalid number of intervals
Error in cut.default(x, breaks = pct, labels = labels, include.lowest = TRUE) : 'breaks' are not unique
At first I thought that these errors are being produced because I am using this function on "groups of rows" - and in some of these rows, there might be too few rows for the desired percentile to be calculated?
I had originally thought that perhaps I could fix this problem by excluding "groups" with an insufficient number of rows:
na.omit(my_data) %>%
group_by(gender, status, country) %>%
filter(n() < 5) %>%
mutate(result1 = ptile(var1, 10), result2 = ptile(var2, 5))
But the same errors still persist.
I was wondering - is there some way to modify this percentile function such that when percentiles might not be able to be calculated at the desired level, the next closest level of percentiles can be calculated?
As an example, if I want percentiles at groups of 10 and this is not possible - perhaps percentiles at groups of 15 or groups of 20 might be possible?
As another example, suppose some group of observations (e.g. male, immigrant, country A) only has 1 observation and I want percentiles in groups of 10 - naturally, it seems that this is not possible. Without knowing that such a group exists in advance, is it possible to modify this ptile function such that it either ignores this group or just calculates the closest possible percentile (e.g. places everything into 1)?
In general, how can I change this ptile function so that these errors can be fixed?
Can someone please suggest a way to do this?
Thanks!
CodePudding user response:
This is not really an answer, but I am quite sure it will help.
First I thought this is the reason:
The reason is that in var 1 and var 2 length of break and label is not equal:
pct_var1 <- quantile(my_data$var1, probs = seq(0, 1, 1/10))
length(pct_var1)
# Create a character vector to store the labels
labels_var1 <- sprintf("%.2f to %.2f percentile %d",
head(pct, -1), tail(pct, -1), seq_len(10))
length(labels_var1)
cut_var1 <- cut(my_data$var1, breaks = pct, labels = labels, include.lowest = TRUE)
pct_var2 <- quantile(my_data$var2, probs = seq(0, 1, 1/5))
length(pct_var2)
# Create a character vector to store the labels
labels_var2 <- sprintf("%.2f to %.2f percentile %d",
head(pct, -1), tail(pct, -1), seq_len(5))
length(labels_var2)
cut_var2 <- cut(my_data$var2, breaks = pct, labels = labels, include.lowest = TRUE)
> length(pct_var1)
[1] 11
> length(labels_var1)
[1] 10
> length(pct_var2)
[1] 6
> length(labels_var2)
[1] 5
In combination with this post Cut and labels/breaks length conflict it should be manageable to solve.
But then I encountered that your second data frame example:
na.omit(my_data) %>%
group_by(gender, status, country) %>%
filter(n() < 5)
removes all data
CodePudding user response:
Various attempts to solve this problem on my own:
Attempt 1:
final_answer = my_data %>% group_by(gender, status, country) %>%
mutate(result1 = case_when(ntile(var1, 10) == 1 ~ paste0(round(min(var1), 2), " to ", round(quantile(var1, 0.1), 2), " group 1"),
ntile(var1, 10) == 2 ~ paste0(round(quantile(var1, 0.1), 2), " to ", round(quantile(var1, 0.2), 2), " group 2"),
ntile(var1, 10) == 3 ~ paste0(round(quantile(var1, 0.2), 2), " to ", round(quantile(var1, 0.3), 2), " group 3"),
ntile(var1, 10) == 4 ~ paste0(round(quantile(var1, 0.3), 2), " to ", round(quantile(var1, 0.4), 2), " group 4"),
ntile(var1, 10) == 5 ~ paste0(round(quantile(var1, 0.4), 2), " to ", round(quantile(var1, 0.5), 2), " group 5"),
ntile(var1, 10) == 6 ~ paste0(round(quantile(var1, 0.5), 2), " to ", round(quantile(var1, 0.6), 2), " group 6"),
ntile(var1, 10) == 7 ~ paste0(round(quantile(var1, 0.6), 2), " to ", round(quantile(var1, 0.7), 2), " group 7"),
ntile(var1, 10) == 8 ~ paste0(round(quantile(var1, 0.7), 2), " to ", round(quantile(var1, 0.8), 2), " group 8"),
ntile(var1, 10) == 9 ~ paste0(round(quantile(var1, 0.8), 2), " to ", round(quantile(var1, 0.9), 2), " group 9"),
ntile(var1, 10) == 10 ~ paste0(round(quantile(var1, 0.9), 2), " to ", round(max(var1), 2), " group 10"))) %>%
mutate(result2 = case_when(ntile(var2, 10) == 1 ~ paste0(round(min(var2), 2), " to ", round(quantile(var2, 0.1), 2), " group 1"),
ntile(var2, 10) == 2 ~ paste0(round(quantile(var2, 0.1), 2), " to ", round(quantile(var2, 0.2), 2), " group 2"),
ntile(var2, 10) == 3 ~ paste0(round(quantile(var2, 0.2), 2), " to ", round(quantile(var2, 0.3), 2), " group 3"),
ntile(var2, 10) == 4 ~ paste0(round(quantile(var2, 0.3), 2), " to ", round(quantile(var2, 0.4), 2), " group 4"),
ntile(var2, 10) == 5 ~ paste0(round(quantile(var2, 0.4), 2), " to ", round(quantile(var2, 0.5), 2), " group 5"))
Attempt 2:
percentile_classifier <- function(x, n_percentiles) {
# Calculate the percentiles
percentiles <- quantile(x, probs = seq(0, 1, 1/n_percentiles))
# Create a character vector to store the labels
labels <- character(length(x))
# Loop through each percentile and assign the corresponding label to each element in the vector
for (i in 1:length(percentiles)) {
lower <- percentiles[i]
upper <- ifelse(i == length(percentiles), max(x), percentiles[i 1])
label <- paste0(round(lower, 2), " to ", round(upper, 2), " percentile ", i)
labels[x >= lower & x < upper] <- label
}
# Return the labels
return(labels)
}
final <- my_data %>% group_by(gender, status, country) %>% mutate(result1 = percentile_classifier(var1, 10)) %>% mutate(result2 = percentile_classifier(var2, 5))
CodePudding user response:
If we need to bypass some of these errors i.e. when the number of elements in the group are less, an option is to use tryCatch
or purrr::possibly
library(dplyr)
library(purrr)
f_ptile <- possibly(ptile, otherwise = factor(NA_character_))
-testing
na.omit(my_data) %>%
group_by(gender, status, country) %>%
slice(1) %>% # this fails with original ptile function
mutate(result1 = f_ptile(var1, 10), result2 = f_ptile(var2, 5))
# A tibble: 16 × 8
# Groups: gender, status, country [16]
gender status disease country var1 var2 result1 result2
<fct> <fct> <fct> <fct> <dbl> <dbl> <fct> <fct>
1 Female Citizen No A 11902. -2436. <NA> <NA>
2 Female Citizen Yes B 3508. 6851. <NA> <NA>
3 Female Citizen Yes C 16854. -1769. <NA> <NA>
4 Female Citizen No D 12372. 4363. <NA> <NA>
5 Female Immigrant Yes A 9635. 695. <NA> <NA>
6 Female Immigrant No B 3120. -2674. <NA> <NA>
7 Female Immigrant No C -1635. 3971. <NA> <NA>
8 Female Immigrant No D -3163. 5802. <NA> <NA>
9 Male Citizen No A 3836. 8196. <NA> <NA>
10 Male Citizen No B 6125. 8096. <NA> <NA>
11 Male Citizen Yes C 2159. 9863. <NA> <NA>
12 Male Citizen No D 3622. 8159. <NA> <NA>
13 Male Immigrant No A -3003. 6874. <NA> <NA>
14 Male Immigrant Yes B -27.1 4063. <NA> <NA>
15 Male Immigrant No C 4166. 2103. <NA> <NA>
16 Male Immigrant No D 5718. 17866. <NA> <NA>