Extract the capitalized letter sequence in a column and replace the column with the newly truncated-CodePudding

I have the following character vector that includes parentheses, periods, and unnecessary descriptive words

strings <- c("Poorly Graded Silty Sand (SP-SM).", "(Visual) Lean Clay (CL), with some sand.","Poorly Graded Silty Sand (SP-SM).","(Visual) Inorganic Silt (ML).","(Visual) Lean Clay (CL), with some sand.")

I wish to extract only the letter coding system that resides within the parenthesis on each line (ex: ML or SP-SM). Here is the desired vector.

need <- c("SP-SM", "CL","SP-SM","ML","CL")

Is this possible?

CodePudding user response：

We may use str_extract with a regex lookaround to match the opening parentheses followed by one or more upper case letters with -, followed by the closing parentheses

library(stringr)
str_extract(strings, "(?<=\\()[A-Z-] (?=\\))")
[1] "SP-SM" "CL"    "SP-SM" "ML"    "CL"

CodePudding user response：

This is the long version of akrun's solution:

str_extract(strings, '\\b[A-Z]{2}\\b\\-\\b[A-Z]{2}\\b|\\b[A-Z]{2}\\b')

output:

[1] "SP-SM" "CL"    "SP-SM" "ML"    "CL"

Explanation:

\\b Matches between a word character and a non-word character.

[A-Z]{2} Matches exactly two capital letters.

\\- Matches a hyphen.

\\b Matches between a word character and a non-word character.

| defines OR