Home > front end >  How can I add a new grouping variable
How can I add a new grouping variable

Time:04-20

How can I add another variable as a grouping variable based on the date? Here is my data:

df <- tibble(id = c(rep(1, 8), rep(2, 3)),
             col = structure(c(19031, 19031, 19031, 19031, 19072, 19072, 19072, 19072, 19031, 19031, 
                               19031), class = "Date"),
             grouping_var = c(rep(1, 4), rep(2,4), rep(1, 3) ) )

and the column grouping_variable is the expected outcome I want to have. So Basically, I want to assign a grouping based on the date. If an id has more than one date in the col then the first date gets assigned a 1 and the second a 2, and if an id has only one date the it only gets a 1. A dpylr solution would be particularly great here.

Thanks

CodePudding user response:

You can group by id, then match the dates column against the sorted unique values of itself.

df %>% 
  group_by(id) %>%
  mutate(grouping_var = match(col, sort(unique(col))))
#> # A tibble: 11 x 3
#> # Groups:   id [2]
#>       id col        grouping_var
#>    <dbl> <date>            <int>
#>  1     1 2022-02-08            1
#>  2     1 2022-02-08            1
#>  3     1 2022-02-08            1
#>  4     1 2022-02-08            1
#>  5     1 2022-03-21            2
#>  6     1 2022-03-21            2
#>  7     1 2022-03-21            2
#>  8     1 2022-03-21            2
#>  9     2 2022-02-08            1
#> 10     2 2022-02-08            1
#> 11     2 2022-02-08            1

CodePudding user response:

You can use dense_rank() in dplyr, which is a wrapper of match(x, sort(unique(x))).

df %>%
  group_by(id) %>%
  mutate(grp_var = dense_rank(col))

# A tibble: 11 x 3
# Groups:   id [2]
      id col        grouping_var
   <dbl> <date>            <int>
 1     1 2022-02-08            1
 2     1 2022-02-08            1
 3     1 2022-02-08            1
 4     1 2022-02-08            1
 5     1 2022-03-21            2
 6     1 2022-03-21            2
 7     1 2022-03-21            2
 8     1 2022-03-21            2
 9     2 2022-02-08            1
10     2 2022-02-08            1
11     2 2022-02-08            1

CodePudding user response:

Using cumsum(duplicated(.)) in ave.

transform(df, grouping_var2=ave(as.numeric(col), id, 
                                FUN=\(x) cumsum(!duplicated(sort(x)))[order(x)]))
#    id        col grouping_var
# 1   1 2022-02-08            1
# 2   1 2022-02-08            1
# 3   1 2022-02-08            1
# 4   1 2022-02-08            1
# 5   1 2022-03-21            2
# 6   1 2022-03-21            2
# 7   1 2022-03-21            2
# 8   1 2022-03-21            2
# 9   2 2022-02-08            1
# 10  2 2022-02-08            1
# 11  2 2022-02-08            1

Alternatively using findInterval as @GKi suggests:

transform(df, grouping_var2=ave(as.numeric(col), id, 
                                FUN=\(x) findInterval(x, sort(unique(x)))))

For dplyr just use mutate instead of transform.

  • Related