fast counting of appearences in R data.table-CodePudding

I am having a large data.table (around 500 rows and 2.5 million columns). Columns are different features that can have 4 states (here "normal", "medium", "high", NA). I want to count the appearances of these states for every feature. For this I wrote a script that is basically working in smaller dt. However, in my full data.table. It's running since 3 days now and still not finished. Any ideas how to get that faster?

# example code

library(data.table)

samples <- c("sample_one", "sample_two", "sample_three", "sample_four", "sample_five", "sample_six", "sample_seven", "sample_eight")
feature_one <- c("normal", "medium", "high", NA, "normal", NA, "high", NA)
feature_two <- c("medium", "medium", "medium", "medium", "high", "medium", "normal", NA)
feature_three <- c("normal", "normal", "high", NA, "normal", "medium", "medium", NA)
feature_four <- c("high", "medium", "normal", "medium", "normal", "medium", "high", "normal")
feature_five <- c("normal", "normal", "normal", NA, "normal", "medium", "medium", "medium")

feature_dt <- data.table(samples = samples,
                         feature_one = feature_one,
                         feature_two = feature_two,
                         feature_three = feature_three,
                         feature_four = feature_four,
                         feature_five = feature_five)

cols <- setdiff(names(feature_dt), "samples")

number_of_vars <- length(cols)

na_counts <- vector("list", number_of_vars)
names(na_counts) <- cols

normal_counts <- vector("list", number_of_vars)
names(normal_counts) <- cols

medium_counts <- vector("list", number_of_vars)
names(medium_counts) <- cols

high_counts <- vector("list", number_of_vars)
names(high_counts) <- cols

for (col in cols) {
  eval(parse(text = paste0("na_counts[[\"", col, "\"]] <- feature_dt[, .N, by = ", col, "][is.na(", col, "), N]")))
  eval(parse(text = paste0("normal_counts[[\"", col, "\"]] <- feature_dt[, .N, by = ", col, "][", col, " == \"normal\", N]")))
  eval(parse(text = paste0("medium_counts[[\"", col, "\"]] <- feature_dt[, .N, by = ", col, "][", col, " == \"medium\", N]")))
  eval(parse(text = paste0("high_counts[[\"", col, "\"]] <- feature_dt[, .N, by = ", col, "][", col, " == \"high\", N]")))
}

CodePudding user response：

It is usually much easier to work on long data (many rows, few columns) than on wide data (many columns, few rows). Threfore, I would convert the data into longg format and run the aggregation on it

feature_dt[, samples := NULL]
dt_melted <- melt.data.table(feature_dt,
                             measure.vars = names(feature_dt),
                             variable.name = "FEATURE",
                             value.name = "VALUE")
dt_melted[, .N, keyby = .(FEATURE, VALUE)]
#>           FEATURE  VALUE N
#>  1:   feature_one   <NA> 3
#>  2:   feature_one   high 2
#>  3:   feature_one medium 1
#>  4:   feature_one normal 2
#>  5:   feature_two   <NA> 1
#>  6:   feature_two   high 1
#>  7:   feature_two medium 5
#>  8:   feature_two normal 1
#>  9: feature_three   <NA> 2
#> 10: feature_three   high 1
#> 11: feature_three medium 2
#> 12: feature_three normal 3
#> 13:  feature_four   high 2
#> 14:  feature_four medium 3
#> 15:  feature_four normal 3
#> 16:  feature_five   <NA> 1
#> 17:  feature_five medium 3
#> 18:  feature_five normal 4

CodePudding user response：

We can use table on each column and bind into a single table:

out <- rbindlist(
#   lapply(feature_dt[, .SD, .SDcols = -"samples"], \(z) as.data.table(table(state = z, useNA = "always"))), 
#   idcol = "feature")
out
#           feature  state     N
#            <char> <char> <int>
#  1:   feature_one   high     2
#  2:   feature_one medium     1
#  3:   feature_one normal     2
#  4:   feature_one   <NA>     3
#  5:   feature_two   high     1
#  6:   feature_two medium     5
#  7:   feature_two normal     1
#  8:   feature_two   <NA>     1
#  9: feature_three   high     1
# 10: feature_three medium     2
# 11: feature_three normal     3
# 12: feature_three   <NA>     2
# 13:  feature_four   high     2
# 14:  feature_four medium     3
# 15:  feature_four normal     3
# 16:  feature_four   <NA>     0
# 17:  feature_five medium     3
# 18:  feature_five normal     4
# 19:  feature_five   <NA>     1

If you then want it in a pivoted/reshaped format, we can certainly do one of the following, depending on your preferences:

dcast(state ~ feature, data = out, value.var = "N", fill = 0L)
#     state feature_five feature_four feature_one feature_three feature_two
#    <char>        <int>        <int>       <int>         <int>       <int>
# 1:   <NA>            1            0           3             2           1
# 2:   high            0            2           2             1           1
# 3: medium            3            3           1             2           5
# 4: normal            4            3           2             3           1

dcast(feature ~ state, data = out, value.var = "N", fill = 0L)
#          feature    NA  high medium normal
#           <char> <int> <int>  <int>  <int>
# 1:  feature_five     1     0      3      4
# 2:  feature_four     0     2      3      3
# 3:   feature_one     3     2      1      2
# 4: feature_three     2     1      2      3
# 5:   feature_two     1     1      5      1

Note that the column name in the last expression is "NA", not NA; as such, in follow-on processing you'll need to quote or backtick it (instead of trying to refer to it as the symbol NA).