R - Regular Expression to Extract Text Between Parentheses That Contain Keyword-CodePudding

I need to extract the text from between parentheses if a keyword is inside the parentheses.

So if I have a string that looks like this:

('one', 'CARDINAL'), ('Castro', 'PERSON'), ('Latin America', 'LOC'), ('Somoza', 'PERSON')

And my keyword is "LOC", I just want to extract ('Latin America', 'LOC'), not the others.

Help is appreciated!!

This is a sample of my data set, a csv file:

,speech_id,sentence,date,speaker,file,parsed_text,named_entities
0,950094636,Let me state that the one sure way we can make it easy for Castro to continue to gain converts in Latin America is if we continue to support regimes of the ilk of the Somoza family,19770623,Mr. OBEY,06231977.txt,Let me state that the one sure way we can make it easy for Castro to continue to gain converts in Latin America is if we continue to support regimes of the ilk of the Somoza family,"[('one', 'CARDINAL'), ('Castro', 'PERSON'), ('Latin America', 'LOC'), ('Somoza', 'PERSON')]"
1,950094636,That is how we encourage the growth of communism,19770623,Mr. OBEY,06231977.txt,That is how we encourage the growth of communism,[]
2,950094636,That is how we discourage the growth of democracy in Latin America,19770623,Mr. OBEY,06231977.txt,That is how we discourage the growth of democracy in Latin America,"[('Latin America', 'LOC')]"
3,950094636,Mr Chairman,19770623,Mr. OBEY,06231977.txt,Mr Chairman,[]
4,950094636,given the speeches I have made lately about the press,19770623,Mr. OBEY,06231977.txt,given the speeches I have made lately about the press,[]
5,950094636,I am not one,19770623,Mr. OBEY,06231977.txt,I am not one,[]
6,950094636,I suppose,19770623,Mr. OBEY,06231977.txt,I suppose,[]

I am trying to extract just parentheses with the word LOC:

regex <- "(?=\\().*? \'LOC.*?(?<=\\))"
  
  
filtered_df$clean_NE <- str_extract_all(filtered_df$named_entities, regex)

The above regular expression does not work. Thanks!

CodePudding user response：

You can use

str_extract_all(filtered_df$named_entities, "\\([^()]*'LOC'[^()]*\\)")

See the regex demo. Details:

\( - a ( char
[^()]* - zero or more chars other than ( and )
'LOC' - a 'LOC' string
[^()]* - zero or more chars other than ( and )
\) - a ) char.

See the online R demo:

library(stringr)
x <- "[('one', 'CARDINAL'), ('Castro', 'PERSON'), ('Latin America', 'LOC'), ('Somoza', 'PERSON')]"
str_extract_all(x, "\\([^()]*'LOC'[^()]*\\)")
# => [1] "('Latin America', 'LOC')"

As a bonus solution to get Latin America, you can use

str_extract_all(x, "[^'] (?=',\\s*'LOC'\\))")
# => [1] "Latin America"

Here, [^'] (?=',\s*'LOC'\)) matches one or more chars other than ' that are followed with ',, zero or more whitespaces, and then 'LOC') string.