This is based upon this question. The question said:
Is there a way for me to subset data based on column names starting with a particular string? I have some columns which are like ABC_1 ABC_2 ABC_3 and some like XYZ_1, XYZ_2,XYZ_3 let's say.
How can I subset my df based only on columns containing the above portions of text (lets say, ABC or XYZ)? I can use indices, but the columns are too scattered in data and it becomes too much of hard coding.
Also, I want to only include rows from each of these columns where any of their value is >0 so if either of the 6 columns above has a 1 in the row, it makes a cut into my final data frame.
One of the answers used dplyr to select multiple columns for the new data:
df <- df %>% dplyr:: select(grep("ABC", names(df)), grep("XYZ", names(df)))
However, I'd like a data.table solution similar to the above. I know that (as per one of the answers) that data.table can do one condition, but unsure how to do multiple:
df[, grep("ABC", names(df)), with = FALSE]
CodePudding user response:
Why not this?
set.seed(1)
df <- data.frame( ABC_1 = sample(0:1,3,repl = TRUE),
ABC_2 = sample(0:1,3,repl = TRUE),
XYZ_1 = sample(0:1,3,repl = TRUE),
XYZ_2 = sample(0:1,3,repl = TRUE),
BCD_1 = sample(0:1, 3, replace = TRUE))
library(data.table)
df[, c(grep("ABC", names(df)), grep("XYZ", names(df)))]
#> ABC_1 ABC_2 XYZ_1 XYZ_2
#> 1 0 0 0 1
#> 2 1 1 0 0
#> 3 0 0 1 0
Created on 2022-07-08 by the reprex package (v2.0.1)
CodePudding user response:
Say you had a data.table
with these many columns:
library(data.table)
DF <- matrix(data = rnorm(n = 50), nrow = 5, ncol = 10) |>
round(digits = 2) |>
as.data.table()
colnames(DF) <- paste0(letters[1:5], 1:10)
DF
#> a1 b2 c3 d4 e5 a6 b7 c8 d9 e10
#> 1: 0.50 0.23 1.59 1.77 1.14 0.54 1.43 0.77 -1.22 1.25
#> 2: 0.16 -0.66 0.91 0.21 -2.96 -0.76 0.26 0.28 -0.59 0.23
#> 3: -1.60 -1.01 -0.22 1.14 -1.94 -0.65 -0.36 0.94 0.31 0.47
#> 4: -1.50 0.26 -0.54 0.60 -0.81 -0.41 0.30 0.52 0.10 -1.02
#> 5: -0.60 -1.64 1.33 0.31 -1.82 -0.33 -1.89 0.34 -0.68 0.72
To select only those columns starting with a certain prefix, say "a", you can use startsWith()
:
DF[, .SD, .SDcols = startsWith(x = colnames(DF), prefix = "a")]
#> a1 a6
#> 1: 0.50 0.54
#> 2: 0.16 -0.76
#> 3: -1.60 -0.65
#> 4: -1.50 -0.41
#> 5: -0.60 -0.33
To select all those columns whose names contain certain strings, eg. either an "a" or a "b", you can use grep()
with a pattern that matches either "a" or "b":
DF[, .SD, .SDcols = grep(pattern = "a|b", x = colnames(DF))]
#> a1 b2 a6 b7
#> 1: 0.50 0.23 0.54 1.43
#> 2: 0.16 -0.66 -0.76 0.26
#> 3: -1.60 -1.01 -0.65 -0.36
#> 4: -1.50 0.26 -0.41 0.30
#> 5: -0.60 -1.64 -0.33 -1.89
CodePudding user response:
You can select multiple columns that match certain patterns in data.table using patterns
in its .SDcols
argument:
# turn df into data.table
setDT(df)
# select columns that start with ABD or XYZ
df[, .SD, .SDcols=patterns("ABC|XYZ")]
# or
df[, grep("ABC|XYZ", names(df)), with=FALSE]
subsetting rows and columns at the same time
cols = grep("ABC|XYZ", names(df))
df[rowSums(df[, ..cols]>0)>0, .SD, .SDcols=cols]
CodePudding user response:
One-liner for data.table
(using objectclosure's sample DF
):
DF[, .SD[rowSums(.SD > 0) > 0,], .SDcols = patterns("^a")]
# a1 a6
# <num> <num>
# 1: 1.37 -0.70
# 2: 1.92 0.60
# 3: 0.41 -0.06
# 4: -1.43 1.28
# 5: 0.22 -0.89
For clarity since there are two > 0
s in the code:
.SD > 0
is your filter of "value over 0"; androwSums(.) > 0
means one or more values on a row meets the above condition.
Translation, if you wanted rows where one or more values are greater than 3, then the inner code would be .SD[rowSums(.SD > 3) > 0,]
.