In R, how to find the location of a word in a string?-CodePudding

How can I find the first location of specific words in a dataframe cell, and save the output in a new column in the same dataframe?

Ideally I want the first match for each of the words in dictionary.

df <- data.frame(text = c("omg coke is so awsme","i always preferred pepsi", "mozart is so overrated by yeah fanta makes my day, always"))

dict <- c("coke", "pepsi", "fanta")

Location can be N of characters or words preceding the dictionary word.

I've been playing around with the code found here, but I can't make it work.

For example, this code does the job, but only for one word, and for one string (rather than a df and dictionary)

my_string = "omg coke is so awsme"
unlist(gregexpr("coke", my_string))[1]

Desired output:

                                                       text  location
1                                      omg coke is so awsme         2
2                                  i always preferred pepsi         4
3 mozart is so overrated by yeah fanta makes my day, always         7

Like I said, the location can also be string rather than word, if that is easier.

CodePudding user response：

Here's a simple for loop:

for(i in dict) {
  df[[i]] = stringi::stri_locate_first_fixed(df$text, i)[, 1]
}
df
#                                                        text coke pepsi fanta
# 1                                      omg coke is so awsme    5    NA    NA
# 2                                  i always preferred pepsi   NA    20    NA
# 3 mozart is so overrated by yeah fanta makes my day, always   NA    NA    32

Or with regexpr (part of base, so no dependencies):

for(i in dict) {
  df[[i]] = regexpr(i, df$text, fixed = TRUE)
}
df
#                                                        text coke pepsi fanta
# 1                                      omg coke is so awsme    5    -1    -1
# 2                                  i always preferred pepsi   -1    20    -1
# 3 mozart is so overrated by yeah fanta makes my day, always   -1    -1    32

And here's a solution for word number, though I would recommend deleting all the punctuation before using this:

df$words = strsplit(df$text, split = " ")
for(i in dict) {
  df[[i]] = sapply(df$words, \(x) match(i, unlist(x)))
}
df
#                                                        text coke pepsi fanta
# 1                                      omg coke is so awsme    2    NA    NA
# 2                                  i always preferred pepsi   NA     4    NA
# 3 mozart is so overrated by yeah fanta makes my day, always   NA    NA     7
#                                                                 words
# 1                                            omg, coke, is, so, awsme
# 2                                         i, always, preferred, pepsi
# 3 mozart, is, so, overrated, by, yeah, fanta, makes, my, day,, always

CodePudding user response：

Just run

c(regexpr(paste0(dict,collapse = '|'), df$text))

[1]  5 20 32

Edit:

if you want the location of the words:

library(tidyverse)
pat <-  sprintf(".*(%s)", paste0(dict,collapse = '|'))
df %>%
  mutate(loc = str_count(str_extract(text,pat), "\\w "))

                                                       text loc
1                                      omg coke is so awsme   2
2                                  i always preferred pepsi   4
3 mozart is so overrated by yeah fanta makes my day, always   7