Subset data to contain only columns whose names match multiple condition using data.table-CodePudding

This is based upon this question. The question said:

Is there a way for me to subset data based on column names starting with a particular string? I have some columns which are like ABC_1 ABC_2 ABC_3 and some like XYZ_1, XYZ_2,XYZ_3 let's say.

How can I subset my df based only on columns containing the above portions of text (lets say, ABC or XYZ)? I can use indices, but the columns are too scattered in data and it becomes too much of hard coding.

Also, I want to only include rows from each of these columns where any of their value is >0 so if either of the 6 columns above has a 1 in the row, it makes a cut into my final data frame.

One of the answers used dplyr to select multiple columns for the new data:

df <- df %>% dplyr:: select(grep("ABC", names(df)), grep("XYZ", names(df)))

However, I'd like a data.table solution similar to the above. I know that (as per one of the answers) that data.table can do one condition, but unsure how to do multiple:

df[, grep("ABC", names(df)), with = FALSE]

CodePudding user response：

Why not this?

set.seed(1)
df <- data.frame( ABC_1 = sample(0:1,3,repl = TRUE),
                  ABC_2 = sample(0:1,3,repl = TRUE),
                  XYZ_1 = sample(0:1,3,repl = TRUE),
                  XYZ_2 = sample(0:1,3,repl = TRUE),
                  BCD_1 = sample(0:1, 3, replace = TRUE))
library(data.table)

df[, c(grep("ABC", names(df)), grep("XYZ", names(df)))]
#>   ABC_1 ABC_2 XYZ_1 XYZ_2
#> 1     0     0     0     1
#> 2     1     1     0     0
#> 3     0     0     1     0

^{Created on 2022-07-08 by the reprex package (v2.0.1)}

CodePudding user response：

Say you had a data.table with these many columns:

library(data.table)

DF <- matrix(data = rnorm(n = 50), nrow = 5, ncol = 10) |> 
  round(digits = 2) |> 
  as.data.table()

colnames(DF) <- paste0(letters[1:5], 1:10)
DF
#>       a1    b2    c3   d4    e5    a6    b7   c8    d9   e10
#> 1:  0.50  0.23  1.59 1.77  1.14  0.54  1.43 0.77 -1.22  1.25
#> 2:  0.16 -0.66  0.91 0.21 -2.96 -0.76  0.26 0.28 -0.59  0.23
#> 3: -1.60 -1.01 -0.22 1.14 -1.94 -0.65 -0.36 0.94  0.31  0.47
#> 4: -1.50  0.26 -0.54 0.60 -0.81 -0.41  0.30 0.52  0.10 -1.02
#> 5: -0.60 -1.64  1.33 0.31 -1.82 -0.33 -1.89 0.34 -0.68  0.72

To select only those columns starting with a certain prefix, say "a", you can use startsWith():

DF[, .SD, .SDcols = startsWith(x = colnames(DF), prefix = "a")]
#>       a1    a6
#> 1:  0.50  0.54
#> 2:  0.16 -0.76
#> 3: -1.60 -0.65
#> 4: -1.50 -0.41
#> 5: -0.60 -0.33

To select all those columns whose names contain certain strings, eg. either an "a" or a "b", you can use grep() with a pattern that matches either "a" or "b":

DF[, .SD, .SDcols = grep(pattern = "a|b", x = colnames(DF))]
#>       a1    b2    a6    b7
#> 1:  0.50  0.23  0.54  1.43
#> 2:  0.16 -0.66 -0.76  0.26
#> 3: -1.60 -1.01 -0.65 -0.36
#> 4: -1.50  0.26 -0.41  0.30
#> 5: -0.60 -1.64 -0.33 -1.89

CodePudding user response：

You can select multiple columns that match certain patterns in data.table using patterns in its .SDcols argument:

# turn df into data.table
setDT(df)

# select columns that start with ABD or XYZ
df[, .SD, .SDcols=patterns("ABC|XYZ")]

# or
df[, grep("ABC|XYZ", names(df)), with=FALSE]

subsetting rows and columns at the same time

cols = grep("ABC|XYZ",  names(df))

df[rowSums(df[, ..cols]>0)>0, .SD, .SDcols=cols]

CodePudding user response：

One-liner for data.table (using objectclosure's sample DF):

DF[, .SD[rowSums(.SD > 0) > 0,], .SDcols = patterns("^a")]
#       a1    a6
#    <num> <num>
# 1:  1.37 -0.70
# 2:  1.92  0.60
# 3:  0.41 -0.06
# 4: -1.43  1.28
# 5:  0.22 -0.89

For clarity since there are two > 0s in the code:

.SD > 0 is your filter of "value over 0"; and
rowSums(.) > 0 means one or more values on a row meets the above condition.

Translation, if you wanted rows where one or more values are greater than 3, then the inner code would be .SD[rowSums(.SD > 3) > 0,].