Sample code using dput:
df <- structure(list
(TCGA.OR.A5JP.01A = c(0.0980697379293791, NA, NA,0.883701102465278, 0.920634671107133),
TCGA.OR.A5JG.01A = c(0.909142796219422, NA, NA, 0.870551482839855, 0.9170243029211),
TCGA.PK.A5HB.01A = c(0.860316269591325, NA, NA, 0.283919878689488, 0.92350756003924),
TCGA.OR.A5JE.01A = c(0.288860652773179,NA, NA, 0.831906751819423, 0.913890036560933),
TCGA.OR.A5KU.01A = c(0.0897293436489091,NA, NA, 0.166760246036103, 0.920367435681197)),
row.names = c("cg00000029","cg00000108", "cg00000109", "cg00000165", "cg00000236"),
class = "data.frame")
I want to create a subset keeping columns which only contain certain patterns at positions 11 and 12 (I counted the "."s.). For example, the "x's" in TCGA.OR.A5xx.01A. I have a list of multiple codes/patterns for that position (e.g., "JG", "HB", "KU").
I have tried:
df_subset <- subset(df, select=grepl("JG|HB|KU",names(df)))
but it is not position specific and columns which coincidentally contain those patterns are included.
I also have a second question - can I somehow do this with a list of patterns? There are over 30 patterns I put in a list and I'm wondering if I could use that list instead of typing them all out again.
CodePudding user response:
We could use a combination of str_locate
and which
to select columns. If you have a list of search terms, then those can be collapsed into one list with paste0
. Then, we can locate the search terms at particular positions (i.e., 11
and 12
), and select those columns.
library(tidyverse)
key_chr <- c("JG", "HB", "KU")
search_terms <- paste0(key_chr, collapse = "|")
df %>%
select(which(str_locate(names(df), search_terms)[,1] == 11 & str_locate(names(df), search_terms)[,2] == 12))
Or in base R, we could write it as:
df <- df[, which(regexpr(search_terms, names(df)) == 11)]
Output
TCGA.OR.A5JG.01A TCGA.PK.A5HB.01A TCGA.OR.A5KU.01A
cg00000029 0.9091428 0.8603163 0.08972934
cg00000108 NA NA NA
cg00000109 NA NA NA
cg00000165 0.8705515 0.2839199 0.16676025
cg00000236 0.9170243 0.9235076 0.92036744
CodePudding user response:
Another approach that does not relies on regex could be:
subset(df, select = substring(names(df), 11,12) %in% c("JG", "HB", "KU"))
##> TCGA.OR.A5JG.01A TCGA.PK.A5HB.01A TCGA.OR.A5KU.01A
##> cg00000029 0.9091428 0.8603163 0.08972934
##> cg00000108 NA NA NA
##> cg00000109 NA NA NA
##> cg00000165 0.8705515 0.2839199 0.16676025
##> cg00000236 0.9170243 0.9235076 0.92036744