How can I find the first location of specific words in a dataframe cell, and save the output in a new column in the same dataframe?
Ideally I want the first match for each of the words in dictionary.
df <- data.frame(text = c("omg coke is so awsme","i always preferred pepsi", "mozart is so overrated by yeah fanta makes my day, always"))
dict <- c("coke", "pepsi", "fanta")
Location can be N of characters or words preceding the dictionary word.
I've been playing around with the code found here, but I can't make it work.
For example, this code does the job, but only for one word, and for one string (rather than a df and dictionary)
my_string = "omg coke is so awsme"
unlist(gregexpr("coke", my_string))[1]
Desired output:
text location
1 omg coke is so awsme 2
2 i always preferred pepsi 4
3 mozart is so overrated by yeah fanta makes my day, always 7
Like I said, the location can also be string rather than word, if that is easier.
CodePudding user response:
Here's a simple for loop:
for(i in dict) {
df[[i]] = stringi::stri_locate_first_fixed(df$text, i)[, 1]
}
df
# text coke pepsi fanta
# 1 omg coke is so awsme 5 NA NA
# 2 i always preferred pepsi NA 20 NA
# 3 mozart is so overrated by yeah fanta makes my day, always NA NA 32
Or with regexpr
(part of base, so no dependencies):
for(i in dict) {
df[[i]] = regexpr(i, df$text, fixed = TRUE)
}
df
# text coke pepsi fanta
# 1 omg coke is so awsme 5 -1 -1
# 2 i always preferred pepsi -1 20 -1
# 3 mozart is so overrated by yeah fanta makes my day, always -1 -1 32
And here's a solution for word number, though I would recommend deleting all the punctuation before using this:
df$words = strsplit(df$text, split = " ")
for(i in dict) {
df[[i]] = sapply(df$words, \(x) match(i, unlist(x)))
}
df
# text coke pepsi fanta
# 1 omg coke is so awsme 2 NA NA
# 2 i always preferred pepsi NA 4 NA
# 3 mozart is so overrated by yeah fanta makes my day, always NA NA 7
# words
# 1 omg, coke, is, so, awsme
# 2 i, always, preferred, pepsi
# 3 mozart, is, so, overrated, by, yeah, fanta, makes, my, day,, always
CodePudding user response:
Just run
c(regexpr(paste0(dict,collapse = '|'), df$text))
[1] 5 20 32
Edit:
if you want the location of the words:
library(tidyverse)
pat <- sprintf(".*(%s)", paste0(dict,collapse = '|'))
df %>%
mutate(loc = str_count(str_extract(text,pat), "\\w "))
text loc
1 omg coke is so awsme 2
2 i always preferred pepsi 4
3 mozart is so overrated by yeah fanta makes my day, always 7