I want to extract a pattern of phrases from the following sentences.
text1 <- "On a year-on-year basis, the number of subscribers of Netflix increased 1.15% in November last year."
text2 <- "There is no confirmed audited number of subscribers in the Netflix's earnings report."
text3 <- "Netflix's unaudited number of subscribers has grown more than 1.50% at the last quarter."
The pattern is number of subscribers
or audited number of subscribers
or unaudited number of subscribers
.
I am using the following pattern \\bnumber\\s of\\s subscribers?\\b
from a previous problem (Thanks to @wiktor-stribiżew) and then extracting the phrases.
find_words <- function(text){
pattern <- "\\bnumber\\s of\\s subscribers?\\b" # something like this
str_extract(text, pattern)
}
However, this extracts the exact number of subscriber
not the other patterns.
Desired output:
find_words(text1)
'number of subscribers'
find_words(text2)
'audited number of subscribers'
find_words(text3)
'unaudited number of subscribers'
CodePudding user response:
See if this works
find_words <- function(text){
pattern <- "(audited |unaudited )?number\\s of\\s subscribers"
str_extract(text, pattern)
}
You can test it with the sample texts you provided:
find_words(text1)
# 'number of subscribers'
find_words(text2)
# 'audited number of subscribers'
find_words(text3)
# 'unaudited number of subscribers'