Home > database >  How to subset a character vector based on substring matches?
How to subset a character vector based on substring matches?

Time:10-23

I want to create ori.same.maf.barcodes variable to store the strings of ori.maf.barcode if the substrings before fourth "-" character matches the strings in sub.same.barcodes.

sub.same.barcodes "TCGA-BQ-7058-01A" "TCGA-DZ-6131-02A" "TCGA-UZ-A9PZ-03A" "TCGA-2Z-A9JQ-01A" "TCGA-BQ-5887-11A" "TCGA-2Z-A9JQ-01A"

ori.maf.barcode example:

"TCGA-BQ-7058-01A-11D-1963-05" "TCGA-DZ-6131-01A-11D-1963-05"
"TCGA-UZ-A9PZ-01A-11D-A42K-05" "TCGA-2Z-A9JQ-01A-11D-A42K-05"
"TCGA-BQ-5887-11A-01D-1963-05" "TCGA-G7-7502-01A-12D-A43K-06"

Expected output:

ori.same.maf.barcodes

"TCGA-BQ-7058-01A-11D-1963-05" 
"TCGA-2Z-A9JQ-01A-11D-A42K-05"
"TCGA-BQ-5887-11A-01D-1963-05" 
"TCGA-G7-7502-01A-12D-A43K-06"

Attempt:

ori.same.maf.barcodes <- ori.maf.barcode %in% sub.same.barcodes

But my code returns "FALSE" instead of a character vector.

CodePudding user response:

Please note that with the sample data you have provided it is not possible for the value TCGA-G7-7502-01A-12D-A43K-06 to appear in the output.

library(stringr)

sub.same.barcodes <- c("TCGA-BQ-7058-01A", "TCGA-DZ-6131-02A", "TCGA-UZ-A9PZ-03A", 
                       "TCGA-2Z-A9JQ-01A", "TCGA-BQ-5887-11A", "TCGA-2Z-A9JQ-01A")

ori.maf.barcode <- c("TCGA-BQ-7058-01A-11D-1963-05", "TCGA-DZ-6131-01A-11D-1963-05",
                     "TCGA-UZ-A9PZ-01A-11D-A42K-05", "TCGA-2Z-A9JQ-01A-11D-A42K-05",
                     "TCGA-BQ-5887-11A-01D-1963-05", "TCGA-G7-7502-01A-12D-A43K-06")

idx <- which(str_extract_all(ori.maf.barcode, '.{4}-.{2}-.{4}-.{3}') %in% sub.same.barcodes)
ori.same.maf.barcodes <- ori.maf.barcode[ idx ]
print(ori.same.maf.barcodes)

CodePudding user response:

We could use sub to extract the substring till the fourth - and then use %in% on the logical vector to subset

i1 <- sub("^(([^-] -){4}).*", "\\1", ori.maf.barcode) %in%  
       sub("^(([^-] -){4}).*", "\\1", sub.same.barcodes)
ori.same.maf.barcodes <- ori.maf.barcode[i1]

CodePudding user response:

Your almost there, but your code ori.maf.barcode %in% sub.same.barcodes creates the logical equation that returns TRUE and FALSE, which is what you are seeing. In order to get back the values which equate to TRUE you need to pass that expression into a subsetting method to get back what you want.

ori.maf.barcode[which(ori.maf.barcode %in% sub.same.barcodes)]

If it is a vector this should return another vector with only those entries which are TRUE in the logical statement.

And you need to string match to get the entries based on the first part as iod said below:

This is a loop picks them out one at a time and adds them to a new vector

new.barcodes<-c()
for (sub in sub.same.barcodes){
  new<- ori.maf.barcode[which(startsWith(ori.maf.barcode, sub))]
  new.barcodes<-c(new.barcodes, new)
}

This will iterate through your prefixes and pull out what you want into a new vector

  •  Tags:  
  • r
  • Related