Home > front end >  Extract sentences in a paragraph in R
Extract sentences in a paragraph in R

Time:09-27

I have a paragraph such as below:

y = "I have been working with ABC CORPORATION nearly about 4 years now. And today remark my last day working at this company. I am proudly announce that I will joining XYZ SDN BHD starting this Monday."

I need to extract only company name from this whole paragraph so that the output only show like this:

"ABC CORPORATION", "XYZ SDN BHD"

Is there a way to do it in R as I'm not really familiar yet with text analysis in R.

Is using dplyr split better or grep?

CodePudding user response:

using stringr's str_extract_all ;

y = "I have been working with ABC CORPORATION nearly about 4 years now. And today remark my last day working at this company. I am proudly announce that I will joining XYZ SDN BHD starting this Monday."

uppercase_words <- unlist(stringr::str_extract_all(y,pattern = '([:upper:]|[:space:]){2,}'))
uppercase_words <- uppercase_words[nchar(gsub('[[:blank:]]','',uppercase_words))!=1]

uppercase_words

output;

' ABC CORPORATION '' XYZ SDN BHD '

CodePudding user response:

We could use str_extract_all with a regex pattern:

library(stringr)
str_extract_all(y,"[A-Z][\\w-]*(\\s [A-Z][\\w-]*) ")

output:

[1] "ABC CORPORATION" "XYZ SDN BHD" 
  • Related