I have a dataframe in R. I want to match with and keep the row if
- "woman" is the first or
- the second word in a sentence, or
- if it is the third word in a sentence and preceded by the words "no," "not," or "never."
phrases_with_woman <- structure(list(phrase = c("woman get degree", "woman obtain justice",
"session woman vote for member", "woman have to end", "woman have no existence",
"woman lose right", "woman be much", "woman mix at dance", "woman vote as member",
"woman have power", "woman act only", "she be woman", "no committee woman passed vote")), row.names = c(NA,
-13L), class = "data.frame")
In the above example, I want to be able to match with all rows except for "she be woman."
This is my code so far. I have a positive look-around ((?<=woman\\s)\\w "
) that seems to be on the right track, but it matches with too many preceding words. I tried using {1}
to match with just one preceding word, but this syntax didn't work.
matches <- phrases_with_woman %>%
filter(str_detect(phrase, "^woman|(?<=woman\\s)\\w "))
Help is appreciated.
CodePudding user response:
Each of the conditions can be an alternative although the last one requires two alternatives assuming that no/not/never can be either the first or second word.
library(dplyr)
pat <- "^(woman|\\w woman|\\w (no|not|never) woman|(no|not|never) \\w woman)\\b"
phrases_with_woman %>%
filter(grepl(pat, phrase))
CodePudding user response:
I haven't come up with a regex
solution but here is a workaround.
library(dplyr)
library(stringr)
phrases_with_woman %>%
filter(str_detect(word(phrase, 1, 2), "\\bwoman\\b") |
(word(phrase, 3) == "woman" & str_detect(word(phrase, 1, 2), "\\b(no|not|never)\\b")))
# phrase
# 1 woman get degree
# 2 woman obtain justice
# 3 session woman vote for member
# 4 woman have to end
# 5 woman have no existence
# 6 woman lose right
# 7 woman be much
# 8 woman mix at dance
# 9 woman vote as member
# 10 woman have power
# 11 woman act only
# 12 no committee woman passed vote