How would I create a subset by matching multiple patterns at a specific location in column names?-CodePudding

Sample code using dput:

df <- structure(list
          (TCGA.OR.A5JP.01A = c(0.0980697379293791, NA, NA,0.883701102465278, 0.920634671107133), 
               TCGA.OR.A5JG.01A = c(0.909142796219422, NA, NA, 0.870551482839855, 0.9170243029211), 
               TCGA.PK.A5HB.01A = c(0.860316269591325, NA, NA, 0.283919878689488, 0.92350756003924), 
               TCGA.OR.A5JE.01A = c(0.288860652773179,NA, NA, 0.831906751819423, 0.913890036560933), 
               TCGA.OR.A5KU.01A = c(0.0897293436489091,NA, NA, 0.166760246036103, 0.920367435681197)), 
          row.names = c("cg00000029","cg00000108", "cg00000109", "cg00000165", "cg00000236"), 
          class = "data.frame")

I want to create a subset keeping columns which only contain certain patterns at positions 11 and 12 (I counted the "."s.). For example, the "x's" in TCGA.OR.A5xx.01A. I have a list of multiple codes/patterns for that position (e.g., "JG", "HB", "KU").

I have tried:

df_subset <- subset(df, select=grepl("JG|HB|KU",names(df)))

but it is not position specific and columns which coincidentally contain those patterns are included.

I also have a second question - can I somehow do this with a list of patterns? There are over 30 patterns I put in a list and I'm wondering if I could use that list instead of typing them all out again.

CodePudding user response：

We could use a combination of str_locate and which to select columns. If you have a list of search terms, then those can be collapsed into one list with paste0. Then, we can locate the search terms at particular positions (i.e., 11 and 12), and select those columns.

library(tidyverse)

key_chr <- c("JG", "HB", "KU")
search_terms <- paste0(key_chr, collapse = "|")

df %>% 
  select(which(str_locate(names(df), search_terms)[,1] == 11 & str_locate(names(df), search_terms)[,2] == 12))

Or in base R, we could write it as:

df <- df[, which(regexpr(search_terms, names(df)) == 11)]

Output

           TCGA.OR.A5JG.01A TCGA.PK.A5HB.01A TCGA.OR.A5KU.01A
cg00000029        0.9091428        0.8603163       0.08972934
cg00000108               NA               NA               NA
cg00000109               NA               NA               NA
cg00000165        0.8705515        0.2839199       0.16676025
cg00000236        0.9170243        0.9235076       0.92036744

CodePudding user response：

Another approach that does not relies on regex could be:

subset(df, select = substring(names(df), 11,12) %in% c("JG", "HB", "KU"))

##>            TCGA.OR.A5JG.01A TCGA.PK.A5HB.01A TCGA.OR.A5KU.01A
##> cg00000029        0.9091428        0.8603163       0.08972934
##> cg00000108               NA               NA               NA
##> cg00000109               NA               NA               NA
##> cg00000165        0.8705515        0.2839199       0.16676025
##> cg00000236        0.9170243        0.9235076       0.92036744