Filter column using grepl to keep particular string match-CodePudding

I have columns of counts labelled with different sample type. To begin the comparison I would like to to subset particular group of samples.

In order to do that I was trying this

df2 <- df1[,!grepl("pH|B|M|C|G|L", colnames(df1))]

My objective is to keep the samples which starts with H and TCGA

How do to that I understand when im running the above lines it is also removing TCGA labelled comlumn since it contains Letters which are also present in TCGA.

I tried the other way

df2 <- df1[,grepl("H|TCGA", colnames(df1))]

here the issue is since I have samples which are labelled as pH are also getting selected.

How do to resolve it.

Any help or suggestion would be really helpful

names(df1)
  [1] "H1"           "H2"           "H3"           "H4"           "B11"          "B12"          "B1"           "B2"           "B3"          
 [10] "B4"           "B5"           "B6"           "B7"           "B8"           "B9"           "C1"           "C2"           "C3"          
 [19] "C4"           "G1"           "G2"           "G3"           "G4"           "L1"           "L2"           "L3"           "L4"          
 [28] "L5"           "L6"           "L7"           "L8"           "M1"           "M2"           "M3"           "M4"           "pH10"        
 [37] "pH11"         "pH12"         "pH1"          "pH2"          "pH3"          "pH4"          "pH5"          "pH6"          "pH7"         
 [46] "pH8"          "pH9"          "TCGA-AB-2856" "TCGA-AB-2849" "TCGA-AB-2971"

CodePudding user response：

Use ^ to match the beginning of a string, as in

df2 <- df1[,grepl("^H|^TCGA", colnames(df1))]

We can also use dplyr with starts_with():

library(dplyr)

df1 %>%
    select(starts_with(c('H', 'TCGA'))