Home > OS >  How to create a regex expression to get a substring between 2 pipes
How to create a regex expression to get a substring between 2 pipes

Time:05-03

I have a dataset that I'm trying to work with where I need to get the text between two pipe delimiters. The length of the text is variable so I can't use length to get it. This is the string:

ENST00000000233.10|ENSG00000004059.11|OTTHUMG000

I want to get the text between the first and second pipes, that being ENSG00000004059.11. I've tried several different regex expressions, but I can't really figure out the correct syntax. What should the correct regex expression be?

CodePudding user response:

Here is a regex.

x <- "ENST00000000233.10|ENSG00000004059.11|OTTHUMG000"
sub("^[^\\|]*\\|([^\\|] )\\|.*$", "\\1", x)
#> [1] "ENSG00000004059.11"

Created on 2022-05-03 by the reprex package (v2.0.1)

Explanation:

  • ^ beginning of string;
  • [^\\|]* not the pipe character zero or more times;
  • \\| the pipe character needs to be escaped since it's a meta-character;
  • ^[^\\|]\\| the 3 above combined mean to match anything but the pipe character at the beginning of the string zero or more times until a pipe character is found;
  • ([^\\|] ) group match anything but the pipe character at least once;
  • \\|.*$ the pipe plus anything until the end of the string.

Then keep the 1st (and only) group with "\\1".

CodePudding user response:

Try this: \|.*\| or in R \\|.*\\| since you need to escape the escape characters. (It's just escaping the first pipe followed by any character (.) repeated any number of times (*) and followed by another escaped pipe).

Then wrap in str_sub(MyString, 2, -2) to get rid of the pipes if you don't want them.

CodePudding user response:

Another option is to get the second item after splitting the string on |.

x <- "ENST00000000233.10|ENSG00000004059.11|OTTHUMG000"

strsplit(x, "\\|")[[1]][[2]]
# strsplit(x, "[|]")[[1]][[2]]

# [1] "ENSG00000004059.11"

Or with tidyverse:

library(tidyverse)

str_split(x, "\\|") %>% map_chr(`[`, 2)

# [1] "ENSG00000004059.11"

CodePudding user response:

Maybe use the regex for look ahead and look behind to extract strings that are surrounded by two "|".

The regex literally means - look one or more characters (. ?) behind "|" ((?<=\\|)) until one character before "|" ((?=\\|)).

library(stringr)

x <- "ENST00000000233.10|ENSG00000004059.11|OTTHUMG000"
str_extract(x, "(?<=\\|). ?(?=\\|)")

[1] "ENSG00000004059.11"
  • Related