Search for matches to argument pattern within every item of a character vector and a window function-CodePudding

I have the following string

library(stringi)
s=stri_rand_lipsum(10)

Function grepl searches for matches to argument pattern within every item of a character vector. As far as I know, it performs the search of just one word at once. For example if I would like to search "conubia" and "viverra" I have to perform two searches:

x=s[grepl("conubia",s)]
x=x[grepl("viverra",x)]

Anyway, I would like to search two or more terms which appear in the same entry of s within a window of length equal to, e.g. 140 characters.

CodePudding user response：

You can use *apply family. If your source text is a character vector, I recommend using vapply, but you have to specify the type and the length of the returned values. Because you use grepl, the returned values are logical vectors.

txt = "My name is Abdur Rohman"
patt = c("na", "Ab","man", "om")

vapply(patt, function(x) grepl(x,txt), 
       FUN.VALUE = logical(length(txt)))
# na    Ab   man    om 
# TRUE  TRUE  TRUE FALSE

So, in your example you can use:

s = stri_rand_lipsum(10)
vapply(c("conubia","viverra"), function(x) grepl(x,s), 
       FUN.VALUE = logical(length(s))
#      conubia viverra
# [1,]    TRUE    TRUE
# [2,]   FALSE   FALSE
# [3,]    TRUE   FALSE
# [4,]   FALSE   FALSE
# [5,]   FALSE   FALSE
# [6,]   FALSE    TRUE
# [7,]   FALSE   FALSE
# [8,]   FALSE   FALSE
# [9,]   FALSE   FALSE
#[10,]   FALSE   FALSE

Edit to include a 140-character window

As for the requirement to create a limiting window with 140-character length, as explained in your comment, one way of meeting the requirement is by extracting all characters between the two targeted strings, and then calculate the number of the extracted characters. The requirement is met only if the number is less than or equal to 140.

Extracting all characters between two strings can be done by regular expressions in gsub. However,in case the strings are repeated, you need to specify the window. Let me give examples:

txt <- "Lorem conubia amet conubia ipsum dolor sit amet, finibus torquent diam lobortis dolor ac eget viverra dolor viverra"

This text contains two conubias and two viverras. You have four options to choose the window to specify all characters between conubia and viverra.

Option 1: between the last conubia and the first viverra

gsub(".*conubia(.*?)viverra.*", "\\1", txt, perl = TRUE)
#[1] " ipsum dolor sit amet, finibus torquent diam lobortis dolor ac eget "

Option 2: between the first conubia and the last viverra

gsub(".*?conubia(.*)viverra.*", "\\1", txt, perl = TRUE)
# [1] " amet conubia ipsum dolor sit amet, finibus torquent diam lobortis dolor ac eget viverra dolor "

Option 3: between the first conubia and the first viverra

gsub(".*?conubia(.*?)viverra.*", "\\1", txt, perl = TRUE)
#[1] " amet conubia ipsum dolor sit amet, finibus torquent diam lobortis dolor ac eget "

Option 4: between the last conubia and the last viverra

gsub(".*conubia(.*)viverra.*", "\\1", txt, perl = TRUE)
#[1] " ipsum dolor sit amet, finibus torquent diam lobortis dolor ac eget viverra dolor "

To calculate the number of the extracted characters, nchar can be used.

# Option 1
nchar(gsub(".*conubia(.*?)viverra.*", "\\1", txt, perl = TRUE))
#[1] 68

Applying this approach:

set.seed(8)
s1 <- stri_rand_lipsum(10)
Nch <- nchar(gsub(".*conubia(.*?)viverra.*", "\\1", s1, perl = TRUE))
Nch
# [1] 637  42 512 528 595 640 522 407 388 512

we found that the second element of s1 meets the requirement. To print the element we can use: s1[which(Nch <= 140)].

Some great references I've been learning from: