Home > Enterprise >  Extract multiple substring patterns from a vector
Extract multiple substring patterns from a vector

Time:07-13

Let's say I have a vector as follows:

patient_condition <- c("Pre_P1","Post_P1","Enriched_Post_P1","Post_P1_2","Pre_P2","Post_P2", "P3_Pre")
to_match <- c("P1","P2","P3")

I want to create another vector such that the new vector only contains value in to_match if it is a substring.

[1] "P1"  "P1"  "P1"  "P1"  "P2"  "P2"  "P3"

Any help is appreciated. Thank you!

CodePudding user response:

We can use

stringr::str_extract(patient_condition, "P[0-9] ")
#[1] "P1" "P1" "P1" "P1" "P2" "P2" "P3"

Misc Replies

In my case, this answer works. but I guess the question I ask is extracting substrings from a vector given some values to match. Meaning this answer won't work if I want to extract characters (i.e. Pre, Post, Enriched, etc)

to_match <- c("Pre", "Post", "Enriched")

In that case, we can use

## R-level loop through `to_match`
tmp <- t(sapply(to_match, stringr::str_extract, string = patient_condition))
tmp[!is.na(tmp)]
#[1] "Pre"      "Post"     "Enriched" "Post"     "Pre"      "Post"     "Pre"  

or

## convert multiple matches to REGEX "or" operation `|`
stringr::str_extract(patient_condition, paste0(to_match, collapse = "|"))
#[1] "Pre"      "Post"     "Enriched" "Post"     "Pre"      "Post"     "Pre"

ThomasIsCoding's answer using gregexpr regmatches is also a good alternative.

Note that this is doing exact substrings matching.

CodePudding user response:

You could grep then rep according to the lengths.

Map(rep, to_match, lengths(sapply(to_match, grep, patient_condition)), USE.NAMES=FALSE) |> unlist()
# [1] "P1" "P1" "P1" "P1" "P2" "P2" "P3"

CodePudding user response:

A base R option using regmatches to extract the desired patterns

> regmatches(patient_condition, gregexpr(paste0(to_match, collapse = "|"), patient_condition))
[[1]]
[1] "P1"

[[2]]
[1] "P1"

[[3]]
[1] "P1"

[[4]]
[1] "P1"

[[5]]
[1] "P2"

[[6]]
[1] "P2"

[[7]]
[1] "P3"

CodePudding user response:

more generally you can use match

lookup <- c("Pre_P1","Post_P1","Enriched_Post_P1","Post_P1_2","Pre_P2","Post_P2", "P3_Pre")
to_match <- c("P1","P1","P1", "P1", "P2", "P2","P3")
patient_condition <- c("P3_Pre", "Post_P1", "Enriched_Post_P1")

result <- to_match[match(patient_condition, lookup)]
[1] "P3" "P1" "P1"
  • Related