I have a panel data set with id, year and numerical column A. I want to add a column B with a factor variable indicating to which quantile group each row belongs per year, with either "lowest 33%", "33-67%" or "highest 33%".
I played around with dplyrs group_by(year) and the quantile() function but this just gave me back the value of the quantile. I also found the function quantcut(), that is supposed to be doing something similar to what I need, but I have had no success.
Data:
df <- data.frame(id = c(1, 1, 1, 2, 2, 2, 3, 3, 3,
4, 4, 4, 5, 5, 5),
year = c(2000, 2005, 2010, 2000, 2005, 2010,
2000, 2005, 2010, 2000, 2005, 2010)
A = c(0, 3000, 1000, 5000, 0, 2000, 2000,
1000, 0, 0, 1000, 7000))
This is what I tried last which didn't give me anything useful:
library(tidyverse)
library(gtools)
df <- df %>%
mutate(B = quantcut(A, probs=seq(0.33, 0.67, 1)))
CodePudding user response:
quantile
for the respective years using this code:
df <- data.frame(id = c(1, 1, 1, 2, 2, 2, 3, 3, 3,
4, 4, 4),
year = c(2000, 2005, 2010, 2000, 2005, 2010,
2000, 2005, 2010, 2000, 2005, 2010),
A = c(0, 3000, 1000, 5000, 0, 2000, 2000,
1000, 0, 0, 1000, 7000))
library(dplyr)
df %>%
mutate(B = factor(findInterval(year, c(-Inf, quantile(year, probs=c(0.33, .67)), Inf)),
labels=c("lowest 33%","33-67%","highest 33%")))
#> id year A B
#> 1 1 2000 0 lowest 33%
#> 2 1 2005 3000 33-67%
#> 3 1 2010 1000 highest 33%
#> 4 2 2000 5000 lowest 33%
#> 5 2 2005 0 33-67%
#> 6 2 2010 2000 highest 33%
#> 7 3 2000 2000 lowest 33%
#> 8 3 2005 1000 33-67%
#> 9 3 2010 0 highest 33%
#> 10 4 2000 0 lowest 33%
#> 11 4 2005 1000 33-67%
#> 12 4 2010 7000 highest 33%
Created on 2022-07-30 by the reprex package (v2.0.1)
You could use the function findInterval
where you specify the labels
per quantile
like this:
df <- data.frame(id = c(1, 1, 1, 2, 2, 2, 3, 3, 3,
4, 4, 4),
year = c(2000, 2005, 2010, 2000, 2005, 2010,
2000, 2005, 2010, 2000, 2005, 2010),
A = c(0, 3000, 1000, 5000, 0, 2000, 2000,
1000, 0, 0, 1000, 7000))
library(dplyr)
df %>%
group_by(id) %>%
mutate(B = factor(findInterval(A, c(-Inf, quantile(A, probs=c(0.33, .67)), Inf)),
labels=c("lowest 33%","33-67%","highest 33%")))
#> # A tibble: 12 × 4
#> # Groups: id [4]
#> id year A B
#> <dbl> <dbl> <dbl> <fct>
#> 1 1 2000 0 lowest 33%
#> 2 1 2005 3000 highest 33%
#> 3 1 2010 1000 33-67%
#> 4 2 2000 5000 highest 33%
#> 5 2 2005 0 lowest 33%
#> 6 2 2010 2000 33-67%
#> 7 3 2000 2000 highest 33%
#> 8 3 2005 1000 33-67%
#> 9 3 2010 0 lowest 33%
#> 10 4 2000 0 lowest 33%
#> 11 4 2005 1000 33-67%
#> 12 4 2010 7000 highest 33%
Created on 2022-07-30 by the reprex package (v2.0.1)
CodePudding user response:
I think this may solve your problem, using dplyrs recode
you use findInterval
and simply recode the values.
library(dplyr)
df |>
group_by(year) |>
mutate(B = recode(findInterval(A, quantile(A, probs = seq(0.33, 1, 0.33))),
`0` = "Lowest 33%", `1` = "Lowest 33%", `2` = "33%-67%", `3` = "Highest 33%", .default = "Highest 33%"))
|> ungroup()
id year A B
<dbl> <dbl> <dbl> <chr>
1 1 2000 0 Lowest 33%
2 1 2005 3000 Highest 33%
3 1 2010 1000 Lowest 33%
4 2 2000 5000 Highest 33%
5 2 2005 0 Lowest 33%
6 2 2010 2000 33%-67%
7 3 2000 2000 33%-67%
8 3 2005 1000 33%-67%
9 3 2010 0 Lowest 33%
10 4 2000 0 Lowest 33%
11 4 2005 1000 33%-67%
12 4 2010 7000 Highest 33%