I am working on Single Cell RNA, and trying to demultiplex RAW count matrix. I am following this. In this tutorial, the barcode is in this format:
"BIOKEY_33_Pre_AAACCTGAGAGACTTA-1"
. Where BIOKEY13_Pre
is the prefix and AAACCTGCAACAACCT-1
is the sequence of bases. The prefixes are sample names, so they will be used to demultiplex the data.
Using this regular expression, I can extract the prefixes.
data.pfx <- gsub("(. )_[A-Z] -1$", "\\1", colnames(data.count), perl=TRUE)
.
The problem is, in my data, the barcode is in this format:
AAACCTGAGAAACCGC_LN_05
where the sequence is first, and the sample name is last. I need to extract postfixes. If I run the above regular expression on my data, I get the following output:
data.pfx <- gsub("(. )_[A-Z] -1$", "\\1", colnames(data.count), perl=TRUE)
sample.names <- unique(data.pfx)
head(sample.names)
"AAACCTGAGAAACCGC_LN_05"
"AAACCTGAGAAACGCC_NS_13"
"AAACCTGAGCAATATG_LUNG_N34"
The desired output:
"LN_05"
"NS_13"
"LUNG_N34"
CodePudding user response:
You can use
sub(".*_([A-Z] _[0-9A-Z] )$", "\\1", sample.names)
See the regex demo.
Details:
.*
- any zero or more chars as many as possible_
- an underscore([A-Z] _[0-9A-Z] )
- Group 1 (\1
): one or more uppercase ASCII letters,_
and one or more uppercase ASCII letters o digits$
- end of string.
CodePudding user response:
A bit easier by just removing all leading capital letters up to and including the first underscore
sample.names <- c("AAACCTGAGAAACCGC_LN_05" ,
"AAACCTGAGAAACGCC_NS_13")
sub("^[A-Z] _", "", sample.names)
#> [1] "LN_05" "NS_13"