I need to extract the first part of a text, which is uppercase till the first letter lowercase.
For example, I have the text: "IV LONG TEXT HERE and now the Text End HERE"
I want to extract the "IV LONG TEXT HERE".
I have been trying something like this:
text <- "IV LONG TEXT HERE and now the Text End HERE"
stringr::str_extract_all(text, "[A-Z]")
but I'm failing at the regex.
CodePudding user response:
You could use str_extract, with a pattern to match a single uppercase char and optionally match spaces and uppercase chars ending with another uppercase char.
\b[A-Z](?:[A-Z ]*[A-Z])?\b
Explanation
\b[A-Z]
A word boundary to prevent a partial word match, then match a single char A-Z(?:
Non capture group to match as a whole[A-Z ]*[A-Z]
Match optional chars A-Z or a space and match a char A-Z
)?
Close the non capture group and make it optional\b
A word boundary
Example
text <- "IV LONG TEXT HERE and now the Text End HERE"
stringr::str_extract(text, "\\b[A-Z](?:[A-Z ]*[A-Z])?\\b")
Output
[1] "IV LONG TEXT HERE"
CodePudding user response:
Instead of str_extract
use str_replace
or str_remove
library(stringr)
# match one or more space (\\s ) followed by
# one or more lower case letters ([a-z] ) and rest of the characters (.*)
# to remove those matched characters
str_remove(text, "\\s [a-z] .*")
[1] "IV LONG TEXT HERE"
# or match one or more upper case letters including spaces ([A-Z ] )
# capture as group `()` followed one or more space (\\s ) and rest of
#characters (.*), replace with the backreference (\\1) of captured group
str_replace(text, "([A-Z ] )\\s .*", "\\1")
[1] "IV LONG TEXT HERE"
CodePudding user response:
The below code sample should work.
text <- "IV LONG TEXT HERE and now the Text End HERE"
stringr::str_extract_all(text, "\\w.*[A-Z] \\b")
Output :
[1] 'IV LONG TEXT HERE '
Interpretation :
Return any word character (\w) that appears zero times or more (.*) , satisfies the uppercase ([A-Z]) range and ends up with space ( \b).