Home > Software design >  R Fastest way to get value from a data.table under multiple criteria
R Fastest way to get value from a data.table under multiple criteria

Time:04-30

i am trying to get the fastest (and to some extent elegant) method to extract single elements out of a data.table under a few criteria (with support table).

For simplicity a significantly shortend example:

library(data.table)

dt <- data.table(
      person   = c("Rick", "Michelle", "Richard", "Ryan", "Larry"),
      criteria = c("A", "B", "C", "A", "C"),
      number   = c(5, 62, 25, 77, 91),
      gender   = c("M", "F", "M", "M", "M")
    )

supp.dt <- data.table(
  crits = c("A", "B", "C"),
  ID    = c("ID.1", "ID.2", "ID.3")
)

value.dt <- data.table(
  ID.1M = runif(100, 0, 1),
  ID.1F = runif(100, 0, 1),
  ID.2M = runif(100, 1, 2),
  ID.2F = runif(100, 1, 2),
  ID.3M = runif(100, 2, 3),
  ID.3F = runif(100, 2, 3)
)

getValue <- function(crit=NULL, numb=NULL, gend=NULL) {
  return(value.dt[numb, paste0(supp.dt[crits == crit]$ID, gend), with = F])
}

dt$value <- mapply(getValue, dt$criteria, dt$number, dt$gender)

Benchmark and results:

library(microbenchmark)

runmapply <- function() {
  dt$value <- mapply(getValue, dt$criteria, dt$number, dt$gender)
}

microbenchmark(
  runmapply()
)

# Unit: milliseconds
# expr        min    lq    mean     median uq     max       neval
# runmapply() 5.5528 5.979 7.912311 6.4392 8.4442 24.1152   100

The method here seems to be alright, but

  • my data has up to million rows
  • I need to extract for more than one value, in fact I working with 13 different values; I stored them as a list of data.tables in the first place and

In summary it takes too long for the amount of data and the different values.

Thanks in advance for any ideas for optimization.

UPDATE:

I thought about shaping and joining tables. But had no idea how to approach. I guess I didn't know what data.table was capable of. Thanks @langtang

To be more specific about those 13 tables I mentioned: They have all data.tables with 100 rows but with different amount of columns. I should also mention the names of the columns vary, so the column names ID.1 to ID.3 in the example are set wrongly. But this can easily solved with ID=str_sub(ID, 1, length(ID)).

With more value tables, we can clearly see, that the criteria gives us the connection to the supp.dt, which on the other hand held all the information to the 13 data.tables.

So we could look at the example with two value.dt's and I also change those column names a bit:

library(data.table)

dt <- data.table(
      person   = c("Rick", "Michelle", "Richard", "Ryan", "Larry"),
      criteria = c("A", "B", "C", "A", "C"),
      number   = c(5, 62, 25, 77, 91),
      gender   = c("M", "F", "M", "M", "M")
    )

supp.dt <- data.table(
  crits         = c("A", "B", "C"),
  valueoneID    = c("uno", "dos", "tres"),
  valuetwoID    = c("eins", "zwei", "drei")
) # for every valuetable there is a respective id column

valueone.dt <- data.table(
  unoM  = runif(100, 0, 1),
  unoF  = runif(100, 0, 1),
  dosM  = runif(100, 1, 2),
  dosF  = runif(100, 1, 2),
  tresM = runif(100, 2, 3),
  tresF = runif(100, 2, 3)
)

valuetwo.dt <- data.table(
  einsM = runif(100, 0, 1),
  einsF = runif(100, 0, 1),
  zweiM = runif(100, 1, 2),
  zweiF = runif(100, 1, 2),
  dreiM = runif(100, 2, 3),
  dreiF = runif(100, 2, 3)
)

My expected output should be like: (the ID column is not necessary)

   gender   person   number  valueone  valuetwo 
   <char>   <char>   <int>   <num>     <num>
1: M        Rick     5       0.8572478 0.2312414      
2: M        Ryan     77      0.6211473 0.8585884
3: F        Michelle 62      1.8570321 1.2232323
4: M        Richard  25      2.5732931 2.2323179
5: M        Larry    91      2.0300149 2.0919987

CodePudding user response:

I think the approach will depend on the exact structure of your multiple values (13, and where these are stored), but you could think about using joins, etc.

melt(value.dt[, number:=.I],id.vars = "number", variable.name = "ID")[
  ,`:=`(ID=str_sub(ID, 1,4), gender=str_sub(ID,-1,-1))][
    dt[supp.dt,on=c("criteria"="crits")],
    on=.(ID,gender,number)]

Output:

   number     ID     value gender   person criteria
    <int> <char>     <num> <char>   <char>   <char>
1:      5   ID.1 0.8572478      M     Rick        A
2:     77   ID.1 0.6211473      M     Ryan        A
3:     62   ID.2 1.8570321      F Michelle        B
4:     25   ID.3 2.5732931      M  Richard        C
5:     91   ID.3 2.0300149      M    Larry        C

Update

If you have a number of "value" frames, you can take the following approach:

First, put them in a named list

value_frames = list("valueoneID" = valueone.dt,"valuetwoID" = valuetwo.dt)

Second, create a single value.dt data.table that row-binds melted versions of these different frames from value_frames

value.dt = rbindlist(
  lapply(value_frames, \(f) melt(f[,number:=.I], id.vars="number", variable.name="ID")),
  idcol = "vsrc"
)

Then, join value.dt to a melted version of a join between dt and supp.dt, and dcast the result back to wide.

dcast(
  value.dt[,`:=`(ID = str_sub(ID,1,-2), gender=str_sub(ID,-1,-1))][
    melt(dt[supp.dt,on=c("criteria"="crits")],
        measure.vars = patterns("value"),
        variable.name = "vsrc",
        value.name = "ID"),
    on=.(ID,gender,number,vsrc)],
  gender person number~vsrc, value.var="value"
)
  • Related