I have the following character vector that includes parentheses, periods, and unnecessary descriptive words
strings <- c("Poorly Graded Silty Sand (SP-SM).", "(Visual) Lean Clay (CL), with some sand.","Poorly Graded Silty Sand (SP-SM).","(Visual) Inorganic Silt (ML).","(Visual) Lean Clay (CL), with some sand.")
I wish to extract only the letter coding system that resides within the parenthesis on each line (ex: ML or SP-SM). Here is the desired vector.
need <- c("SP-SM", "CL","SP-SM","ML","CL")
Is this possible?
CodePudding user response:
We may use str_extract
with a regex lookaround to match the opening parentheses followed by one or more upper case letters with -
, followed by the closing parentheses
library(stringr)
str_extract(strings, "(?<=\\()[A-Z-] (?=\\))")
[1] "SP-SM" "CL" "SP-SM" "ML" "CL"
CodePudding user response:
This is the long version of akrun's solution:
str_extract(strings, '\\b[A-Z]{2}\\b\\-\\b[A-Z]{2}\\b|\\b[A-Z]{2}\\b')
output:
[1] "SP-SM" "CL" "SP-SM" "ML" "CL"
Explanation:
\\b
Matches between a word character and a non-word character.
[A-Z]{2}
Matches exactly two capital letters.
\\-
Matches a hyphen.
\\b
Matches between a word character and a non-word character.
|
defines OR