Home > Net >  Using mutate, case_when, %in% to recode partial string matches within character variables containing
Using mutate, case_when, %in% to recode partial string matches within character variables containing

Time:10-15

I have a data frame which consists of social media post data. The two variables of interest are a variable that contains the caption (post_caption) and a variable that describes the kind of post (post_type). The post_caption variable is a long string variable, and the post_type variable is categorical. I would like to recode post_type based on finding partial string matches within the post_caption variable. Example data below.

post_type <- c("type1", "type2", "type3", "type4")
post_caption <- c("This post is about a dog", "This post is about a cat", "This post is about a walrus", "This post is about space")

I have approached recoding other variables (brands and companies) in this dataset using mutate, case_when, and %in%. Example below.

companies_brands %>%
  mutate(brand_r = case_when(brands %in% c("b1prodmod1", "b1prodmod2", "b1prodmod3") ~ "brand1_R",
                             brands %in% c("b2prodmod1", "b2prodmod2", "b2prodmod3") ~ "brand2_R",
                             brands %in% c("b3prodmod1", "b3prodmod2", "b3prodmod3") ~ "brand3_R",
                             brands %in% c("b4prodmod1", "b4prodmod2", "b4prodmod3") ~ "brand4_R",
                             T ~ brands))

This worked for the companies and brands variables (both categorical) so I thought I would be able to able to apply this same approach to the post_caption and post_type variables, but it is not recoding any data. Example below.

post_info %>%
             mutate(post_type_r = case_when(
               post_caption %in% c("dog", "cat", "walrus") ~ "animal_post",
               post_caption %in% c("space", "rocks", "trees") ~ "other_post",
               T ~ post_type))

I think the issue may be that the post_caption variable is a long string variable, and my code is looking for exact matches. Do I need to split the post_caption variable to achieve what I want? Thanks in advance for any help!

CodePudding user response:

I would use grepl instead of %in% because you are trying to partial match.

library(dplyr)

post_type <- c("type1", "type2", "type3", "type4")
post_caption <- c("This post is about a dog", "This post is about a cat", "This post is about a walrus", "This post is about space")

case_when(
  grepl(paste(c("dog", "cat", "walrus"), collapse = "|"), post_caption) ~ "animal_post",
  grepl(paste(c("space", "rocks", "trees"), collapse = "|"), post_caption) ~ "other_post",
  TRUE ~ NA_character_
  )
#> [1] "animal_post" "animal_post" "animal_post" "other_post"
  • Related