Home > front end >  create a subgrouping column based on values in two other columns
create a subgrouping column based on values in two other columns

Time:11-09

I have the following dataframe:


df = data.frame(date  = c("2021-11-17", "2021-11-17", "2022-05-11", "2022-05-11", "2021-11-04", "2021-11-04", "2022-06-09", "2022-06-09"),
                participant_id = rep(c("sub-x", "sub-y"), each = 4))

In that way, sub-x has 4 rows associated (2 different dates and 2 rows per date) and sub-y too (the 4 last dates).

I would like to create a new column called "session" in which there would be, for each participant, either 1 or 2 depending on whether we are dealing with the rows of the first date among the 2 associated with the participant, or the rows of the second date.

It would look like this :

df = data.frame(date  = c("2021-11-17", "2021-11-17", "2022-05-11", "2022-05-11", "2021-11-04", "2021-11-04", "2022-06-09", "2022-06-09"),
                participant_id = rep(c("sub-x", "sub-y"), each = 4)),
session = c("1", "1", "2", "2", "1", "1", "2", "2"))

NB : this is only a sample of my data. In reality, there are 28 rows per date of session and more than 400 participants.

How can I do that?

Thanks a lot in advance!

CodePudding user response:

You can use data.table::rleid:

library(dplyr)
df %>% 
  group_by(participant_id) %>% 
  mutate(session = data.table::rleid(date))

# A tibble: 8 × 3
# Groups:   participant_id [2]
  date       participant_id session
  <chr>      <chr>            <int>
1 2021-11-17 sub-x                1
2 2021-11-17 sub-x                1
3 2022-05-11 sub-x                2
4 2022-05-11 sub-x                2
5 2021-11-04 sub-y                1
6 2021-11-04 sub-y                1
7 2022-06-09 sub-y                2
8 2022-06-09 sub-y                2

CodePudding user response:

Using dplyr, group by id, then use dense_rank() within each id:

library(dplyr)

df %>%
  group_by(participant_id) %>%
  mutate(session = dense_rank(date)) %>%
  ungroup()
# A tibble: 8 × 3
  date       participant_id session
  <chr>      <chr>            <int>
1 2021-11-17 sub-x                1
2 2021-11-17 sub-x                1
3 2022-05-11 sub-x                2
4 2022-05-11 sub-x                2
5 2021-11-04 sub-y                1
6 2021-11-04 sub-y                1
7 2022-06-09 sub-y                2
8 2022-06-09 sub-y                2
  • Related