How can I add another variable as a grouping variable based on the date? Here is my data:
df <- tibble(id = c(rep(1, 8), rep(2, 3)),
col = structure(c(19031, 19031, 19031, 19031, 19072, 19072, 19072, 19072, 19031, 19031,
19031), class = "Date"),
grouping_var = c(rep(1, 4), rep(2,4), rep(1, 3) ) )
and the column grouping_variable
is the expected outcome I want to have. So Basically, I want to assign a grouping based on the date. If an id
has more than one date in the col
then the first date gets assigned a 1
and the second a 2
, and if an id has only one date the it only gets a 1
. A dpylr solution would be particularly great here.
Thanks
CodePudding user response:
You can group by id
, then match
the dates column against the sorted unique
values of itself.
df %>%
group_by(id) %>%
mutate(grouping_var = match(col, sort(unique(col))))
#> # A tibble: 11 x 3
#> # Groups: id [2]
#> id col grouping_var
#> <dbl> <date> <int>
#> 1 1 2022-02-08 1
#> 2 1 2022-02-08 1
#> 3 1 2022-02-08 1
#> 4 1 2022-02-08 1
#> 5 1 2022-03-21 2
#> 6 1 2022-03-21 2
#> 7 1 2022-03-21 2
#> 8 1 2022-03-21 2
#> 9 2 2022-02-08 1
#> 10 2 2022-02-08 1
#> 11 2 2022-02-08 1
CodePudding user response:
You can use dense_rank()
in dplyr
, which is a wrapper of match(x, sort(unique(x)))
.
df %>%
group_by(id) %>%
mutate(grp_var = dense_rank(col))
# A tibble: 11 x 3
# Groups: id [2]
id col grouping_var
<dbl> <date> <int>
1 1 2022-02-08 1
2 1 2022-02-08 1
3 1 2022-02-08 1
4 1 2022-02-08 1
5 1 2022-03-21 2
6 1 2022-03-21 2
7 1 2022-03-21 2
8 1 2022-03-21 2
9 2 2022-02-08 1
10 2 2022-02-08 1
11 2 2022-02-08 1
CodePudding user response:
Using cumsum(duplicated(.))
in ave
.
transform(df, grouping_var2=ave(as.numeric(col), id,
FUN=\(x) cumsum(!duplicated(sort(x)))[order(x)]))
# id col grouping_var
# 1 1 2022-02-08 1
# 2 1 2022-02-08 1
# 3 1 2022-02-08 1
# 4 1 2022-02-08 1
# 5 1 2022-03-21 2
# 6 1 2022-03-21 2
# 7 1 2022-03-21 2
# 8 1 2022-03-21 2
# 9 2 2022-02-08 1
# 10 2 2022-02-08 1
# 11 2 2022-02-08 1
Alternatively using findInterval
as @GKi
suggests:
transform(df, grouping_var2=ave(as.numeric(col), id,
FUN=\(x) findInterval(x, sort(unique(x)))))
For dplyr
just use mutate
instead of transform
.