I am having a large data.table
(around 500 rows and 2.5 million columns). Columns are different features that can have 4 states (here "normal", "medium", "high", NA
). I want to count the appearances of these states for every feature. For this I wrote a script that is basically working in smaller dt
. However, in my full data.table
. It's running since 3 days now and still not finished. Any ideas how to get that faster?
# example code
library(data.table)
samples <- c("sample_one", "sample_two", "sample_three", "sample_four", "sample_five", "sample_six", "sample_seven", "sample_eight")
feature_one <- c("normal", "medium", "high", NA, "normal", NA, "high", NA)
feature_two <- c("medium", "medium", "medium", "medium", "high", "medium", "normal", NA)
feature_three <- c("normal", "normal", "high", NA, "normal", "medium", "medium", NA)
feature_four <- c("high", "medium", "normal", "medium", "normal", "medium", "high", "normal")
feature_five <- c("normal", "normal", "normal", NA, "normal", "medium", "medium", "medium")
feature_dt <- data.table(samples = samples,
feature_one = feature_one,
feature_two = feature_two,
feature_three = feature_three,
feature_four = feature_four,
feature_five = feature_five)
cols <- setdiff(names(feature_dt), "samples")
number_of_vars <- length(cols)
na_counts <- vector("list", number_of_vars)
names(na_counts) <- cols
normal_counts <- vector("list", number_of_vars)
names(normal_counts) <- cols
medium_counts <- vector("list", number_of_vars)
names(medium_counts) <- cols
high_counts <- vector("list", number_of_vars)
names(high_counts) <- cols
for (col in cols) {
eval(parse(text = paste0("na_counts[[\"", col, "\"]] <- feature_dt[, .N, by = ", col, "][is.na(", col, "), N]")))
eval(parse(text = paste0("normal_counts[[\"", col, "\"]] <- feature_dt[, .N, by = ", col, "][", col, " == \"normal\", N]")))
eval(parse(text = paste0("medium_counts[[\"", col, "\"]] <- feature_dt[, .N, by = ", col, "][", col, " == \"medium\", N]")))
eval(parse(text = paste0("high_counts[[\"", col, "\"]] <- feature_dt[, .N, by = ", col, "][", col, " == \"high\", N]")))
}
CodePudding user response:
It is usually much easier to work on long data (many rows, few columns) than on wide data (many columns, few rows). Threfore, I would convert the data into longg format and run the aggregation on it
feature_dt[, samples := NULL]
dt_melted <- melt.data.table(feature_dt,
measure.vars = names(feature_dt),
variable.name = "FEATURE",
value.name = "VALUE")
dt_melted[, .N, keyby = .(FEATURE, VALUE)]
#> FEATURE VALUE N
#> 1: feature_one <NA> 3
#> 2: feature_one high 2
#> 3: feature_one medium 1
#> 4: feature_one normal 2
#> 5: feature_two <NA> 1
#> 6: feature_two high 1
#> 7: feature_two medium 5
#> 8: feature_two normal 1
#> 9: feature_three <NA> 2
#> 10: feature_three high 1
#> 11: feature_three medium 2
#> 12: feature_three normal 3
#> 13: feature_four high 2
#> 14: feature_four medium 3
#> 15: feature_four normal 3
#> 16: feature_five <NA> 1
#> 17: feature_five medium 3
#> 18: feature_five normal 4
CodePudding user response:
We can use table
on each column and bind into a single table:
out <- rbindlist(
# lapply(feature_dt[, .SD, .SDcols = -"samples"], \(z) as.data.table(table(state = z, useNA = "always"))),
# idcol = "feature")
out
# feature state N
# <char> <char> <int>
# 1: feature_one high 2
# 2: feature_one medium 1
# 3: feature_one normal 2
# 4: feature_one <NA> 3
# 5: feature_two high 1
# 6: feature_two medium 5
# 7: feature_two normal 1
# 8: feature_two <NA> 1
# 9: feature_three high 1
# 10: feature_three medium 2
# 11: feature_three normal 3
# 12: feature_three <NA> 2
# 13: feature_four high 2
# 14: feature_four medium 3
# 15: feature_four normal 3
# 16: feature_four <NA> 0
# 17: feature_five medium 3
# 18: feature_five normal 4
# 19: feature_five <NA> 1
If you then want it in a pivoted/reshaped format, we can certainly do one of the following, depending on your preferences:
dcast(state ~ feature, data = out, value.var = "N", fill = 0L)
# state feature_five feature_four feature_one feature_three feature_two
# <char> <int> <int> <int> <int> <int>
# 1: <NA> 1 0 3 2 1
# 2: high 0 2 2 1 1
# 3: medium 3 3 1 2 5
# 4: normal 4 3 2 3 1
dcast(feature ~ state, data = out, value.var = "N", fill = 0L)
# feature NA high medium normal
# <char> <int> <int> <int> <int>
# 1: feature_five 1 0 3 4
# 2: feature_four 0 2 3 3
# 3: feature_one 3 2 1 2
# 4: feature_three 2 1 2 3
# 5: feature_two 1 1 5 1
Note that the column name in the last expression is "NA"
, not NA
; as such, in follow-on processing you'll need to quote or backtick it (instead of trying to refer to it as the symbol NA
).