I have the following string
library(stringi)
s=stri_rand_lipsum(10)
Function grepl
searches for matches to argument pattern within every item of a character vector. As far as I know, it performs the search of just one word at once. For example if I would like to search "conubia" and "viverra" I have to perform two searches:
x=s[grepl("conubia",s)]
x=x[grepl("viverra",x)]
Anyway, I would like to search two or more terms which appear in the same entry of s
within a window of length equal to, e.g. 140 characters.
CodePudding user response:
You can use *apply
family. If your source text is a character vector, I recommend using vapply
, but you have to specify the type and the length of the returned values. Because you use grepl
, the returned values are logical vectors.
txt = "My name is Abdur Rohman"
patt = c("na", "Ab","man", "om")
vapply(patt, function(x) grepl(x,txt),
FUN.VALUE = logical(length(txt)))
# na Ab man om
# TRUE TRUE TRUE FALSE
So, in your example you can use:
s = stri_rand_lipsum(10)
vapply(c("conubia","viverra"), function(x) grepl(x,s),
FUN.VALUE = logical(length(s))
# conubia viverra
# [1,] TRUE TRUE
# [2,] FALSE FALSE
# [3,] TRUE FALSE
# [4,] FALSE FALSE
# [5,] FALSE FALSE
# [6,] FALSE TRUE
# [7,] FALSE FALSE
# [8,] FALSE FALSE
# [9,] FALSE FALSE
#[10,] FALSE FALSE
Edit to include a 140-character window
As for the requirement to create a limiting window with 140-character length, as explained in your comment, one way of meeting the requirement is by extracting all characters between the two targeted strings, and then calculate the number of the extracted characters. The requirement is met only if the number is less than or equal to 140.
Extracting all characters between two strings can be done by regular expressions in gsub
. However,in case the strings are repeated, you need to specify the window. Let me give examples:
txt <- "Lorem conubia amet conubia ipsum dolor sit amet, finibus torquent diam lobortis dolor ac eget viverra dolor viverra"
This text contains two conubia
s and two viverra
s. You have four options to choose the window to specify all characters between conubia and viverra
.
- Option 1: between the last
conubia
and the firstviverra
gsub(".*conubia(.*?)viverra.*", "\\1", txt, perl = TRUE)
#[1] " ipsum dolor sit amet, finibus torquent diam lobortis dolor ac eget "
- Option 2: between the first
conubia
and the lastviverra
gsub(".*?conubia(.*)viverra.*", "\\1", txt, perl = TRUE)
# [1] " amet conubia ipsum dolor sit amet, finibus torquent diam lobortis dolor ac eget viverra dolor "
- Option 3: between the first
conubia
and the firstviverra
gsub(".*?conubia(.*?)viverra.*", "\\1", txt, perl = TRUE)
#[1] " amet conubia ipsum dolor sit amet, finibus torquent diam lobortis dolor ac eget "
- Option 4: between the last
conubia
and the lastviverra
gsub(".*conubia(.*)viverra.*", "\\1", txt, perl = TRUE)
#[1] " ipsum dolor sit amet, finibus torquent diam lobortis dolor ac eget viverra dolor "
To calculate the number of the extracted characters, nchar
can be used.
# Option 1
nchar(gsub(".*conubia(.*?)viverra.*", "\\1", txt, perl = TRUE))
#[1] 68
Applying this approach:
set.seed(8)
s1 <- stri_rand_lipsum(10)
Nch <- nchar(gsub(".*conubia(.*?)viverra.*", "\\1", s1, perl = TRUE))
Nch
# [1] 637 42 512 528 595 640 522 407 388 512
we found that the second element of s1
meets the requirement.
To print the element we can use: s1[which(Nch <= 140)]
.
Some great references I've been learning from: