Home > database >  Combining mutate and text search in R
Combining mutate and text search in R

Time:12-05

I'm trying to make new columns that indicates if a given word is within a larger sentence, such that I could filter all participants who used the word "slow" or "happy" in a comment section.

As a simple example...

fruit_list = data.frame(c("apple", "bannana", "cherries"))

person <- seq(1:3)
comment <- c("I like apple pie", "The worst flavor is bannana", "Those cherries are cheap")
df <- data.frame(person,comment)

for (i in 1:nrow(fruit_list)){
  for (j in 1:nrow(df)){
    df[i,j 2] <- (as.integer(grepl(fruit_list[i,], df[j,2], fixed=TRUE)))
  }
}

The above code works in that it correctly places a 0 or 1, but as I scaled it up to larger lists of words to search for (i.e. fruits) and more sentences (i.e. comments), it (not surprisingly) gets very slow.

I was wondering if using a mutate approach would be more efficient, doing something like this

df <- df %>% 
  mutate(
  apple = as.integer(grepl(fruit_list[1,], comment, fixed =TRUE))
  ) %>% 
  mutate(
  bannana = as.integer(grepl(fruit_list[2,], comment, fixed =TRUE))
  ) %>% 
  mutate(
  cherries = as.integer(grepl(fruit_list[3,], comment, fixed =TRUE))
  )

The only problem here is I can't figure out how to cycle through the items in the list of fruit so I had to directly code the names. For three items that is manageable, but I want this to scale up to dozens of search terms. I've used simple sapply and lapply before, but I'm not sure how to use them in this more complicated scenario.

I'm betting there is some way to pull the name of the mutated column from the list and use that same value as the search term for the grepl, but I just can't piece it together.

Any suggestions would be appreciated- even ones that go in entirely different directions from what I've tried.

CodePudding user response:

You could use

library(tidyverse)

df %>% 
  mutate(names = str_extract(comment, fruit_list),
         values = 1) %>% 
  pivot_wider(names_from = names,
              values_from = values,
              values_fill = 0)

This returns

# A tibble: 3 x 5
  person comment                     apple bannana cherries
   <int> <chr>                       <dbl>   <dbl>    <dbl>
1      1 I like apple pie                1       0        0
2      2 The worst flavor is bannana     0       1        0
3      3 Those cherries are cheap        0       0        1

Note: I changed your fruit_list from data.frame into a vector (for simplification). In your example, you could use fruit_list[,1] instead of vector fruit_list.

CodePudding user response:

We can use dplyr and purrr::map_dfc. It is easier to use a vector of fruits, instead of a data.frame:

library(purrr)
library(dplyr)

fruits<-c("apple", "bannana", "cherries")

map_dfc(fruits, ~ grepl(.x, comment)) %>%
    set_names(fruits) %>%
    bind_cols(df, .)

Or just with:

df %>% mutate(set_names(map_dfc(fruits, ~  grepl(.x, comment)), fruits))

output

  person                     comment apple bannana cherries
1      1            I like apple pie     1       0        0
2      2 The worst flavor is bannana     0       1        0
3      3    Those cherries are cheap     0       0        1
  • Related