Unable to extract postfixes using regular expression in R-CodePudding

I am working on Single Cell RNA, and trying to demultiplex RAW count matrix. I am following this. In this tutorial, the barcode is in this format: "BIOKEY_33_Pre_AAACCTGAGAGACTTA-1". Where BIOKEY13_Pre is the prefix and AAACCTGCAACAACCT-1 is the sequence of bases. The prefixes are sample names, so they will be used to demultiplex the data.

Using this regular expression, I can extract the prefixes. data.pfx <- gsub("(. )_[A-Z] -1$", "\\1", colnames(data.count), perl=TRUE).

The problem is, in my data, the barcode is in this format: AAACCTGAGAAACCGC_LN_05 where the sequence is first, and the sample name is last. I need to extract postfixes. If I run the above regular expression on my data, I get the following output:

data.pfx <- gsub("(. )_[A-Z] -1$", "\\1", colnames(data.count), perl=TRUE)
sample.names <- unique(data.pfx)
head(sample.names)
"AAACCTGAGAAACCGC_LN_05" 
"AAACCTGAGAAACGCC_NS_13"
"AAACCTGAGCAATATG_LUNG_N34"

The desired output:
"LN_05"
"NS_13"
"LUNG_N34"

CodePudding user response：

You can use

sub(".*_([A-Z] _[0-9A-Z] )$", "\\1", sample.names)

See the regex demo.

Details:

.* - any zero or more chars as many as possible
_ - an underscore
([A-Z] _[0-9A-Z] ) - Group 1 (\1): one or more uppercase ASCII letters, _ and one or more uppercase ASCII letters o digits
$ - end of string.

CodePudding user response：

A bit easier by just removing all leading capital letters up to and including the first underscore

sample.names <- c("AAACCTGAGAAACCGC_LN_05" ,
                  "AAACCTGAGAAACGCC_NS_13")
sub("^[A-Z] _", "", sample.names)
#> [1] "LN_05" "NS_13"