How to use data.table to apply function to huge dataset (390 rows and 280976 cols)?-CodePudding

I am trying to apply a function to a huge dataset and apply is very slow. Instead of log2 as in the image I want to take the log10. a is 1/10 of the smallest value in my dataset, which I have pre-defined.

  a <- 0.000876
  for(j in seq_len(ncol(dat2))) {
  set(dat2, j=j, value=log10((j   sqrt(j^2   a^2))/2))
  }

When I use the above code on a smaller subset of the data, I get repeated values for each column, which is incorrect (e.g col1 is all 0.176, col2 is 0.477 etc). I also noticed my other earlier data frames before creating the data table get changed as well, which seems odd (dat and dat1 are also changed).

  dat<- read.csv("...csv")
  dat1<- dat[,-c(1:2)]
  dat2<- setDT(dat1)

Using the apply function on this smaller subset works fine, but is far too slow for the full dataset.

    log_out <- apply(need, 2, function(x)
       log10((x   sqrt(x^2   a^2))/2))

Any help would be appreciated please.

CodePudding user response：

First, your for loop is using the literal value of j in the calculations, not the contents of the jth column. For that, you'd need:

for(j in seq_len(ncol(dat))) {
  set(dat, j=j, value=log10((dat[[j]]   sqrt(dat[[j]]^2   a^2))/2))
}
dat
#           V1        V2        V3       V4       V5       V6       V7       V8       V9      V10
#        <num>     <num>     <num>    <num>    <num>    <num>    <num>    <num>    <num>    <num>
# 1: 0.3521825 0.7084274 0.9068577 1.043398 1.147369 1.231292 1.301639 1.362189 1.415334 1.462688
# 2: 0.5020242 0.7847834 0.9572275 1.080868 1.177173 1.256025 1.322772 1.380635 1.431698 1.477392
# 3: 0.6165806 0.8499998 1.0024225 1.115382 1.205071 1.279429 1.342927 1.398330 1.447469 1.491616

This matches your apply,

# dat, redefined
apply(dat, 2, function(x) log10((x   sqrt(x^2   a^2))/2))
#             V1        V2        V3       V4       V5       V6       V7       V8       V9      V10
# [1,] 0.3521825 0.7084274 0.9068577 1.043398 1.147369 1.231292 1.301639 1.362189 1.415334 1.462688
# [2,] 0.5020242 0.7847834 0.9572275 1.080868 1.177173 1.256025 1.322772 1.380635 1.431698 1.477392
# [3,] 0.6165806 0.8499998 1.0024225 1.115382 1.205071 1.279429 1.342927 1.398330 1.447469 1.491616

However, I think you can do this in a more data.table-canonical method this way:

dat[, log10((.SD   sqrt(.SD^2   a^2))/2)]
#           V1        V2        V3       V4       V5       V6       V7       V8       V9      V10
#        <num>     <num>     <num>    <num>    <num>    <num>    <num>    <num>    <num>    <num>
# 1: 0.3521825 0.7084274 0.9068577 1.043398 1.147369 1.231292 1.301639 1.362189 1.415334 1.462688
# 2: 0.5020242 0.7847834 0.9572275 1.080868 1.177173 1.256025 1.322772 1.380635 1.431698 1.477392
# 3: 0.6165806 0.8499998 1.0024225 1.115382 1.205071 1.279429 1.342927 1.398330 1.447469 1.491616

(which you'd need to reassign back on itself or to a new variable/object). The only "issue" with this method is that you'll have this new object and the old object in memory at the same time (even with 390 rows, 280k columns is not onerous); to do it more in-place, the for loop may be better, though don't underestimate readability, maintainability, and simple.

My sample data:

dat <- data.table(matrix(2:31, nrow = 3))
a <- 1.5 # less than the min of dat

CodePudding user response：

When you want a calculation on every value in your table, convert it to a matrix and take advantage of the fact that you can do any calculation on the matrix itself.

Using the same data as r2evans

m <- matrix(2:31, nrow = 3)
a <- 1.5

Then simply do your math like this

log10((m   sqrt(m^2   a^2))/2)

Results

        [,1]    [,2]    [,3]   [,4]   [,5]   [,6]   [,7]   [,8]   [,9]  [,10]
[1,] 0.35218 0.70843 0.90686 1.0434 1.1474 1.2313 1.3016 1.3622 1.4153 1.4627
[2,] 0.50202 0.78478 0.95723 1.0809 1.1772 1.2560 1.3228 1.3806 1.4317 1.4774
[3,] 0.61658 0.85000 1.00242 1.1154 1.2051 1.2794 1.3429 1.3983 1.4475 1.4916

If you want to calculate a, you can do with a <- min(m) / 10 or replace a in the formula and use log10((m sqrt(m^2 (min(m) / 10)^2))/2)