Home > Back-end >  How to use case_when when doing quartiles in R?
How to use case_when when doing quartiles in R?

Time:05-27

If i have this tibble:

tibble(
  period = c("2010END", "2011END", 
             "2010Q1","2010Q2","2011END"),
  date = c('31-12-2010','31-12-2011', '30-04-2010','31-07-2010','30-09-2010'),
  website = c(
    "google",
    "google",
    "facebook",
    "facebook",
    "youtube"
  ),
  method = c("website",
             "phone",
             "website",
             "laptop",
             "phone"),
  values = c(1, NA, 1, 2, 3))

And then i have this dataframe which tells you which quantiles to create along with the rankings to be made from the ranks:

tibble(
  method = c(
    "phone",
    "phone",
    "phone",
    "website",
    "website",
    "website",
    "laptop",
    "laptop",
    "laptop"
  ), 
  rank = c(3,2,1,3,2,1,3,2,1), 
  tile_condition = c("lowest 25%", "25 to 50%", "more than 50%", 
                     "highest 25%", "25 to 50%", "less than 25%", 
                     "lowest 25%", "25 to 50%", "more than 50%")
)

How can i use a case_when statement to correctly allow myself to create a ranking column which is based on the quartile calculation from the values column in the first dataframe?

I'm trying to apply the quantiles from the other dataframe to create a ranking column in the original dataframe - stuck on how to use case_when for it.

CodePudding user response:

If I've understood your question correctly, you first have to create a table to compare against, as:

df_quants <- 
    df1 %>% 
    drop_na(values) %>% 
    group_by(method) %>% 
    summarize(quant25 = quantile(values, probs = 0.25), 
              quant50 = quantile(values, probs = 0.5), 
              quant75 = quantile(values, probs = 0.75), 
              quant100 = quantile(values, probs = 1))

Then, using a join and a case_when statement, you would arrive at:

df2 %>% 
    left_join(df_quants, by = 'method') %>% 
    mutate(tiles = 
        case_when(rank < quant25 ~ 'lowest 25%', 
                  rank < quant50 ~ '25 to 50%', 
                  rank < quant75 ~ 'more than 50%', 
                  rank >= quant75 ~ 'highest 25%'))

CodePudding user response:

I would do something like this:

set.seed(124)

left_join(
  df1[sample(1:5,1000, replace=T),] %>% 
    mutate(values=sample(c(df1$values,1:30),1000, replace=T)) %>% 
    group_by(method) %>% 
    mutate(q=as.double(cut(values,quantile(values,probs=seq(0,1,0.25), na.rm=T), labels=c(1:4), include.lowest=T))) %>% 
    ungroup(),
  df2 %>% mutate(q = list(1,2,c(3,4),4,c(2,3),1,1,2,c(3,4))) %>% unnest(q),
  by=c("method", "q")
) %>% select(-q)

Output:

# A tibble: 1,000 × 7
   period  date       website  method  values  rank tile_condition
   <chr>   <chr>      <chr>    <chr>    <dbl> <dbl> <chr>         
 1 2010END 31-12-2010 google   website      7     2 25 to 75%     
 2 2011END 31-12-2011 google   phone       18     1 more than 50% 
 3 2010Q1  30-04-2010 facebook website     21     2 25 to 75%     
 4 2011END 30-09-2010 youtube  phone       15     1 more than 50% 
 5 2011END 30-09-2010 youtube  phone       26     1 more than 50% 
 6 2011END 31-12-2011 google   phone        3     3 lowest 25%    
 7 2010END 31-12-2010 google   website      1     1 less than 25% 
 8 2010Q1  30-04-2010 facebook website      2     1 less than 25% 
 9 2010Q2  31-07-2010 facebook laptop      14     2 25 to 50%     
10 2010Q2  31-07-2010 facebook laptop      16     1 more than 50% 
# … with 990 more rows

Notice that I updated your input to 1000 rows, and random new values for the purposes of illustration. Also, notice that I fixed df2, so that method website covers the full range of values. In your example the 50% to 75% quartile is missing.

Adjusted df2 input:

structure(list(method = c("phone", "phone", "phone", "website", 
"website", "website", "laptop", "laptop", "laptop"), rank = c(3, 
2, 1, 3, 2, 1, 3, 2, 1), tile_condition = c("lowest 25%", "25 to 50%", 
"more than 50%", "highest 25%", "25 to 75%", "less than 25%", 
"lowest 25%", "25 to 50%", "more than 50%")), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -9L))
  • Related