I have a data frame which consists of social media post data. The two variables of interest are a variable that contains the caption (post_caption) and a variable that describes the kind of post (post_type). The post_caption variable is a long string variable, and the post_type variable is categorical. I would like to recode post_type based on finding partial string matches within the post_caption variable. Example data below.
post_type <- c("type1", "type2", "type3", "type4")
post_caption <- c("This post is about a dog", "This post is about a cat", "This post is about a walrus", "This post is about space")
I have approached recoding other variables (brands and companies) in this dataset using mutate, case_when, and %in%. Example below.
companies_brands %>%
mutate(brand_r = case_when(brands %in% c("b1prodmod1", "b1prodmod2", "b1prodmod3") ~ "brand1_R",
brands %in% c("b2prodmod1", "b2prodmod2", "b2prodmod3") ~ "brand2_R",
brands %in% c("b3prodmod1", "b3prodmod2", "b3prodmod3") ~ "brand3_R",
brands %in% c("b4prodmod1", "b4prodmod2", "b4prodmod3") ~ "brand4_R",
T ~ brands))
This worked for the companies and brands variables (both categorical) so I thought I would be able to able to apply this same approach to the post_caption and post_type variables, but it is not recoding any data. Example below.
post_info %>%
mutate(post_type_r = case_when(
post_caption %in% c("dog", "cat", "walrus") ~ "animal_post",
post_caption %in% c("space", "rocks", "trees") ~ "other_post",
T ~ post_type))
I think the issue may be that the post_caption variable is a long string variable, and my code is looking for exact matches. Do I need to split the post_caption variable to achieve what I want? Thanks in advance for any help!
CodePudding user response:
I would use grepl
instead of %in%
because you are trying to partial match.
library(dplyr)
post_type <- c("type1", "type2", "type3", "type4")
post_caption <- c("This post is about a dog", "This post is about a cat", "This post is about a walrus", "This post is about space")
case_when(
grepl(paste(c("dog", "cat", "walrus"), collapse = "|"), post_caption) ~ "animal_post",
grepl(paste(c("space", "rocks", "trees"), collapse = "|"), post_caption) ~ "other_post",
TRUE ~ NA_character_
)
#> [1] "animal_post" "animal_post" "animal_post" "other_post"