How can I identify the location of the first word in all caps ignoring words less than 5 characters[-CodePudding

I want to identify the location of the first word in all caps ignoring words less than 5 characters in R.

Example data:

myvec <- c("FILT Words Here before CAPITALS words here after","Words OUT ALLCAPS words MORECAPS words after","Mywords PRE Before CAPLET more words after the capital Letters CAPLETTERS","PRE CAP letters SPLIT here not before")

Desired results:

desired_first_word_over4char_caps <- c(5,3,4,4)
desired_first_word_over4char <- c("CAPITALS","ALLCAPS","CAPLET","SPLIT")

CodePudding user response：

Using regular expressions:

> strsplit(myvec, " ") |> lapply(grep, pattern = "^[A-Z]{5,}$") |> sapply(min)
[1] 5 3 4 4

> strsplit(myvec, " ") |> lapply(grep, pattern = "^[A-Z]{5,}$", value=TRUE) |> sapply(head, 1)
[1] "CAPITALS" "ALLCAPS"  "CAPLET"   "SPLIT"

CodePudding user response：

words <- strsplit(myvec, ' ')
desired_first_word_over4char_caps <- vapply(words, \(x) grep('^[A-Z]{5,}$', x)[1L], integer(1L))
desired_first_word_over4char <- mapply(`[`, words, desired_first_word_over4char_caps)

CodePudding user response：

Here is a one liner that creates a named vector with words/positions, i.e.

 mapply(match, stringr::str_extract(myvec, '\\b[A-Z]{5,}\\b'), strsplit(myvec, ' '))

CAPITALS  ALLCAPS   CAPLET    SPLIT 
       5        3        4        4

You can bring the output to user desired format, i.e.

res <- mapply(match, stringr::str_extract(myvec, '\\b[A-Z]{5,}\\b'), strsplit(myvec, ' '))

names(res)
#[1] "CAPITALS" "ALLCAPS"  "CAPLET"   "SPLIT"   

unname(res)
#[1] 5 3 4 4

data.frame(position = res)
#         position
#CAPITALS        5
#ALLCAPS         3
#CAPLET          4
#SPLIT           4

CodePudding user response：

This uses strcapture to create a data frame with column char and then adds a length column, char_caps. No packages are used.

myvec |>
  strcapture("\\b([A-Z]{5,})\\b", x = _, data.frame(char = character(0))) |>
  transform(char_caps = nchar(char))
##       char char_caps
## 1 CAPITALS         8
## 2  ALLCAPS         7
## 3   CAPLET         6
## 4    SPLIT         5