Home > database >  use grepl in each group after group_by
use grepl in each group after group_by

Time:10-06

Imagine a study where each participant is brought back every day and is asked if their favorite food is turkey or pizza. All participants initially prefer pizza but one day they switch their favorite food to turkey. See df below

df <- data.frame(participant = c(1,1,1,2,2,2,2),
               food = c("pizza", "turkey", "turkey", "pizza", "pizza", "pizza", "turkey"),
               date = c("2012-01-01", "2012-01-02", "2012-01-03","2012-01-01", "2012-01-02", "2012-01-03", "2012-01-04"))

I would like to create two new variables. One that lists the first study day the participant changed their answer choice to turkey (they could have changed it subsequent days as well but I only care about the first time), and another which lists what the date was on this change.

df2 <- data.frame(participant = c(1,1,1,2,2,2,2),
                  food = c("pizza", "turkey", "turkey", "pizza", "pizza", "pizza", "turkey"),
                  date = c("2012-01-01", "2012-01-02", "2012-01-03","2012-01-01", "2012-01-02", "2012-01-03", "2012-01-04"),
                 study_day = c(2,2,2,4,4,4,4),
                 change_date = c("2012-01-02","2012-01-02","2012-01-02","2012-01-04","2012-01-04","2012-01-04","2012-01-04"))

I tried

df %>% group_by(participant) %>% 
  mutate(study_day = which(grepl("turkey", df$food))[1]) %>%
  mutate(change_date = df$date[which(grepl("turkey", df$food))[1]])

but this results in df3

df3 <- data.frame(participant = c(1,1,1,2,2,2,2),
                  food = c("pizza", "turkey", "turkey", "pizza", "pizza", "pizza", "turkey"),
                  date = c("2012-01-01", "2012-01-02", "2012-01-03","2012-01-01", "2012-01-02", "2012-01-03", "2012-01-04"),
                  study_day = c(2,2,2,2,2,2,2),
                  change_date = c("2012-01-02","2012-01-02","2012-01-02","2012-01-02","2012-01-02","2012-01-02","2012-01-02"))

As you see in df3, the study day and change date reflect grepl searching all of df and not just the group I'm interested in after group_by. I suspect I misunderstand how groups/grepl works.

Does anyone have a solution to this?

Thanks!

CodePudding user response:

within mutate, grepl("turkey", food) will give a logical vector. We can use that to select the right date. Below I use an if statement to get NA values when no turkey is preferred yet for an individual.

You can transform the date to a date format to calculate easier with it

df %>% 
  group_by(participant) %>% 
  mutate(
    date = as.Date(date),
    change_date = if(!any(grepl("turkey", food))) NA else{
      min(date[grepl("turkey", food)])
    },
    study_day = format(change_date, format = "%d")
  )

CodePudding user response:

someone on reddit mentioned that if I drop df$ from the relevant portions of code it works! seems like group_by and mutate need plain variable names to access them in their current state

df %>% group_by(participant) %>% 
  mutate(study_day = which(grepl("turkey", food))[1]) %>%
  mutate(change_date = date[which(grepl("turkey", food))[1]])
  • Related