I am working with the R programming language.
In a previous question (R: Creating a Function to Identify Arbitrary Percentiles), I learned how to write a function that calculates "arbitrary" (i.e. user specified) percentiles for a given variable:
ptile <- function(x, n_percentiles) {
# Calculate the percentiles
pct <- quantile(x, probs = seq(0, 1, 1/n_percentiles))
# Create a character vector to store the labels
labels <- sprintf("%.2f to %.2f percentile %d",
head(pct, -1), tail(pct, -1), seq_len(n_percentiles))
cut(x, breaks = pct, labels = labels, include.lowest = TRUE)
}
Now, I want to learn how to apply different "versions" of this function (i.e. this function with different arguments) when different conditions are present.
For example, suppose I have this data frame:
library(dplyr)
set.seed(123)
gender <- factor(sample(c("Male", "Female"), 5000, replace=TRUE, prob=c(0.45, 0.55)))
status <- factor(sample(c("Immigrant", "Citizen"), 5000, replace=TRUE, prob=c(0.3, 0.7)))
country <- factor(sample(c("A", "B", "C", "D"), 5000, replace=TRUE, prob=c(0.25, 0.25, 0.25, 0.25)))
disease <- factor(sample(c("Yes", "No"), 5000, replace=TRUE, prob=c(0.4, 0.6)))
my_data <- data.frame(gender, status, disease, country, var1 = rnorm(5000, 5000, 5000), var2 = rnorm(5000, 5000, 5000))
In general:
For rows in which a unique combination of "gender, status and country" have less than 250 rows, I want to apply the "ptile" function with n_percentiles = 2 on var1 and var2
For rows in which a unique combination of "gender, status and country" have more than 250 rows, I want to apply the "ptile" function with n_percentiles = 5 on var1 and var2
I know how to do this "manually" - first I find out which combinations have more than 250 rows and which combinations have less than 250 rows:
summary = my_data %>%
group_by(gender, status, country) %>%
summarise(counts = n()) %>%
arrange(desc(counts))
Then, I would isolate these rows into two separate datasets and apply the desired version of this function on each dataset:
ex1 = my_data %>%
group_by(gender, status, country) %>%
filter(n() < 250) %>%
mutate(result1 = ptile(var1, 2), result2 = ptile(var1, 2))
ex2 = my_data %>%
group_by(gender, status, country) %>%
filter(n() > 250) %>%
mutate(result1 = ptile(var1, 5), result2 = ptile(var1, 5))
Can someone please tell me - have I done this correctly?
Thanks!
CodePudding user response:
Your codes look well for me. You can also merge respective computations into one pipeline with across()
.
my_data %>%
group_by(gender, status, country) %>%
mutate(across(c(var1, var2), ~ if(n() < 250) ptile(.x, 2) else ptile(.x, 5))) %>%
ungroup()