Mutate new column to create factor variable based on quantiles of other column, by group, in R-CodePudding

I have a panel data set with id, year and numerical column A. I want to add a column B with a factor variable indicating to which quantile group each row belongs per year, with either "lowest 33%", "33-67%" or "highest 33%".

I played around with dplyrs group_by(year) and the quantile() function but this just gave me back the value of the quantile. I also found the function quantcut(), that is supposed to be doing something similar to what I need, but I have had no success.

Data:

df <- data.frame(id = c(1, 1, 1, 2, 2, 2, 3, 3, 3,
                          4, 4, 4, 5, 5, 5),
                 year = c(2000, 2005, 2010, 2000, 2005, 2010,
                          2000, 2005, 2010, 2000, 2005, 2010)
                 A = c(0, 3000, 1000, 5000, 0, 2000, 2000,
                          1000, 0, 0, 1000, 7000))

This is what I tried last which didn't give me anything useful:

library(tidyverse)
library(gtools)
df <- df %>% 
  mutate(B = quantcut(A, probs=seq(0.33, 0.67, 1)))

CodePudding user response：

quantile for the respective years using this code:

df <- data.frame(id = c(1, 1, 1, 2, 2, 2, 3, 3, 3,
                        4, 4, 4),
                 year = c(2000, 2005, 2010, 2000, 2005, 2010,
                          2000, 2005, 2010, 2000, 2005, 2010),
                 A = c(0, 3000, 1000, 5000, 0, 2000, 2000,
                       1000, 0, 0, 1000, 7000))

library(dplyr)

df %>%
  mutate(B = factor(findInterval(year, c(-Inf, quantile(year, probs=c(0.33, .67)), Inf)), 
                    labels=c("lowest 33%","33-67%","highest 33%")))
#>    id year    A           B
#> 1   1 2000    0  lowest 33%
#> 2   1 2005 3000      33-67%
#> 3   1 2010 1000 highest 33%
#> 4   2 2000 5000  lowest 33%
#> 5   2 2005    0      33-67%
#> 6   2 2010 2000 highest 33%
#> 7   3 2000 2000  lowest 33%
#> 8   3 2005 1000      33-67%
#> 9   3 2010    0 highest 33%
#> 10  4 2000    0  lowest 33%
#> 11  4 2005 1000      33-67%
#> 12  4 2010 7000 highest 33%

^{Created on 2022-07-30 by the reprex package (v2.0.1)}

You could use the function findInterval where you specify the labels per quantile like this:

df <- data.frame(id = c(1, 1, 1, 2, 2, 2, 3, 3, 3,
                        4, 4, 4),
                 year = c(2000, 2005, 2010, 2000, 2005, 2010,
                          2000, 2005, 2010, 2000, 2005, 2010),
                 A = c(0, 3000, 1000, 5000, 0, 2000, 2000,
                       1000, 0, 0, 1000, 7000))

library(dplyr)
df %>%
  group_by(id) %>%
  mutate(B = factor(findInterval(A, c(-Inf, quantile(A, probs=c(0.33, .67)), Inf)), 
                    labels=c("lowest 33%","33-67%","highest 33%")))
#> # A tibble: 12 × 4
#> # Groups:   id [4]
#>       id  year     A B          
#>    <dbl> <dbl> <dbl> <fct>      
#>  1     1  2000     0 lowest 33% 
#>  2     1  2005  3000 highest 33%
#>  3     1  2010  1000 33-67%     
#>  4     2  2000  5000 highest 33%
#>  5     2  2005     0 lowest 33% 
#>  6     2  2010  2000 33-67%     
#>  7     3  2000  2000 highest 33%
#>  8     3  2005  1000 33-67%     
#>  9     3  2010     0 lowest 33% 
#> 10     4  2000     0 lowest 33% 
#> 11     4  2005  1000 33-67%     
#> 12     4  2010  7000 highest 33%

^{Created on 2022-07-30 by the reprex package (v2.0.1)}

CodePudding user response：

I think this may solve your problem, using dplyrs recode you use findInterval and simply recode the values.

library(dplyr)

df |> 
  group_by(year) |>
  mutate(B = recode(findInterval(A, quantile(A, probs = seq(0.33, 1, 0.33))), 
`0` = "Lowest 33%", `1` = "Lowest 33%", `2` = "33%-67%", `3` = "Highest 33%", .default = "Highest 33%"))
|> ungroup()

     id  year     A B          
   <dbl> <dbl> <dbl> <chr>      
 1     1  2000     0 Lowest 33% 
 2     1  2005  3000 Highest 33%
 3     1  2010  1000 Lowest 33% 
 4     2  2000  5000 Highest 33%
 5     2  2005     0 Lowest 33% 
 6     2  2010  2000 33%-67%    
 7     3  2000  2000 33%-67%    
 8     3  2005  1000 33%-67%    
 9     3  2010     0 Lowest 33% 
10     4  2000     0 Lowest 33% 
11     4  2005  1000 33%-67%    
12     4  2010  7000 Highest 33%