Home > other >  Calculating Quartiles with multiple groups in R?
Calculating Quartiles with multiple groups in R?

Time:05-28

I have the following tibble of data:

df1 <- tibble(
  period = c("2010END", "2011END", 
             "2010Q1","2010Q2","2011END"),
  date = c('31-12-2010','31-12-2011', '30-04-2010','31-07-2010','30-09-2010'),
  website = c(
    "google",
    "google",
    "facebook",
    "facebook",
    "youtube"
  ),
  method = c("website",
             "phone",
             "website",
             "laptop",
             "phone"),
  values = c(1, NA, 1, 2, 3))

And a tibble of qaurtile ranks:

df2 <-
  structure(
    list(
      method = c(
        "phone",
        "phone",
        "phone",
        "website",
        "website",
        "website",
        "laptop",
        "laptop",
        "laptop"
      ),
      rank = c(3,
               2, 1, 3, 2, 1, 3, 2, 1),
      tile_condition = c(
        "lowest 25%",
        "25 to 50%",
        "more than 50%",
        "highest 25%",
        "25 to 75%",
        "less than 25%",
        "lowest 25%",
        "25 to 50%",
        "more than 50%"
      )
    ),
    class = c("tbl_df",
              "tbl", "data.frame"),
    row.names = c(NA, -9L)
  )

I'm trying to find a way of finding of creating a dataframe that has the ranks like in df2 based on the values in df1

However I would like to calculate the quartiles with this grouping:

df1 %>% group_by(period, method)

So that the quartiles are created against the grouping of period and method e.g. for the period 2010END and the methods - find the quartiles for those values. Then left join the ranking scores from df2 there so we can tell where those values lie in the quartiles

Is there a way to do this in tidyverse?

CodePudding user response:

Your sample df1 is too small to demonstrate the solution so this creates an artificial dataset structured like yours.

library(data.table)
##
#   this creates an artificial dataset...
#   with enough data to demonstrate the solution
#
df1 <- data.table(period=c('2010END', '2011END'))
df1 <- df1[, .(method=c('phone', 'website', 'laptop')), by=.(period)]
df1 <- df1[, .(values=rpois(sample(50:100, 1), 10)), by=.(period, method)]
##
#   you start here...
#
setDT(df1)   # with your df1 will need this to convert to data.table
labs <- c('q1', 'q2', 'q3', 'q4')
df1[, tile_condition:=cut(values, breaks= quantile(values, probs = c(0, .25, 0.50, 0.75, 1.0)), 
                          labels=labs, include.lowest = TRUE)
    , by=.(period, method)]

CodePudding user response:

A tidyverse solution :

## 1.We group the data and create bins with cut function (3 breaks as requested)
## 2.We transform the intervals (factor data) to numbers with dense_rank function
## 3.We recover the description of each rank from the 2nd dataframe

df1 %>% 
  group_by(period,method) %>%
  mutate(q=cut(values,
                breaks = unique(quantile(values, probs = c(0,.25, 0.50,1))),
                include.lowest = TRUE)) %>% 
  arrange(q) %>% 
  mutate(rank=dense_rank(q)) %>% 
  left_join(y = df2,by = c("method","rank"))
  • Related