Home > Enterprise >  R: Using dplyr to Perform "Conditional" Functions
R: Using dplyr to Perform "Conditional" Functions

Time:12-29

I am working with the R programming language.

In a previous question (R: Creating a Function to Identify Arbitrary Percentiles), I learned how to write a function that calculates "arbitrary" (i.e. user specified) percentiles for a given variable:

ptile <- function(x, n_percentiles) {
  # Calculate the percentiles
  pct <- quantile(x, probs = seq(0, 1, 1/n_percentiles))

  # Create a character vector to store the labels
  labels <- sprintf("%.2f to %.2f percentile %d",
                    head(pct, -1), tail(pct, -1), seq_len(n_percentiles))

  cut(x, breaks = pct, labels = labels, include.lowest = TRUE)
}

Now, I want to learn how to apply different "versions" of this function (i.e. this function with different arguments) when different conditions are present.

For example, suppose I have this data frame:

library(dplyr)

set.seed(123)
gender <- factor(sample(c("Male", "Female"), 5000, replace=TRUE, prob=c(0.45, 0.55)))
status <- factor(sample(c("Immigrant", "Citizen"), 5000, replace=TRUE, prob=c(0.3, 0.7)))
country <- factor(sample(c("A", "B", "C", "D"), 5000, replace=TRUE, prob=c(0.25, 0.25, 0.25, 0.25)))
disease <- factor(sample(c("Yes", "No"), 5000, replace=TRUE, prob=c(0.4, 0.6)))

my_data <- data.frame(gender, status, disease, country, var1 = rnorm(5000, 5000, 5000), var2 = rnorm(5000, 5000, 5000))

In general:

  • For rows in which a unique combination of "gender, status and country" have less than 250 rows, I want to apply the "ptile" function with n_percentiles = 2 on var1 and var2

  • For rows in which a unique combination of "gender, status and country" have more than 250 rows, I want to apply the "ptile" function with n_percentiles = 5 on var1 and var2

I know how to do this "manually" - first I find out which combinations have more than 250 rows and which combinations have less than 250 rows:

summary = my_data %>%
  group_by(gender, status, country) %>%
  summarise(counts = n()) %>%
  arrange(desc(counts))

Then, I would isolate these rows into two separate datasets and apply the desired version of this function on each dataset:

ex1 = my_data %>%
    group_by(gender, status, country) %>%
    filter(n() < 250) %>%
    mutate(result1 = ptile(var1, 2), result2 = ptile(var1, 2))

ex2 = my_data %>%
    group_by(gender, status, country) %>%
    filter(n() > 250) %>%
    mutate(result1 = ptile(var1, 5), result2 = ptile(var1, 5))

Can someone please tell me - have I done this correctly?

Thanks!

CodePudding user response:

Your codes look well for me. You can also merge respective computations into one pipeline with across().

my_data %>%
  group_by(gender, status, country) %>%
  mutate(across(c(var1, var2), ~ if(n() < 250) ptile(.x, 2) else ptile(.x, 5))) %>%
  ungroup()
  • Related