Home > Net >  Pattern matching character strings when everything but the last two elements of the character are th
Pattern matching character strings when everything but the last two elements of the character are th

Time:12-30

I have the following vector.

column_names <- c("6Li", "7Li", "10B", "11B", "7Li.1",
                  "205Pb", "206Pb", "207Pb", "238U",
                  "206Pb.1", "238U.1")

Notice that some of the values are just duplicates with a ".1" stuck at the end. I want to index out all of these character strings along with their corresponding character strings that match such that only the following are returned.

#[1] "7Li"     "7Li.1"   "206Pb"   "238U"    "206Pb.1" "238U.1" 

Assume you don't know the index positions and so you cannot simply index these values out as follows column_names[c(2,5,7,9,10,11)]. How can I use pattern matching to extract these values?

CodePudding user response:

There is likely a more elegant solution, but in base R you cold try a combination of grep/gsub and paste:

idx <- grep(paste(gsub("\\.1", "", column_names[grep("\\.1", column_names)]), collapse = "|"), column_names)
# [1]  2  5  7  9 10 11

column_names[idx]
# [1] "7Li"     "7Li.1"   "206Pb"   "238U"    "206Pb.1" "238U.1" 

CodePudding user response:

Using gsub() and duplicated() to find values with repeated stems:

column_stems <- gsub("\\.1", "", column_names)

dup_idx <- duplicated(column_stems) | duplicated(column_stems, fromLast = TRUE)

column_names[dup_idx]
# "7Li"     "7Li.1"   "206Pb"   "238U"    "206Pb.1" "238U.1" 

To also find instances ending with .2, .3, etc., use "\\.\\d " instead of "\\.1" in gsub().

CodePudding user response:

You could use stringr:

library(stringr)

idx <- str_extract(column_names, ".*(?=\\.1)")

column_names[str_detect(column_names, paste(idx[!is.na(idx)], collapse = "|"))]

This returns

#> [1] "7Li"     "7Li.1"   "206Pb"   "238U"    "206Pb.1" "238U.1" 
  •  Tags:  
  • r
  • Related