i am trying to get the fastest (and to some extent elegant) method to extract single elements out of a data.table under a few criteria (with support table).
For simplicity a significantly shortend example:
library(data.table)
dt <- data.table(
person = c("Rick", "Michelle", "Richard", "Ryan", "Larry"),
criteria = c("A", "B", "C", "A", "C"),
number = c(5, 62, 25, 77, 91),
gender = c("M", "F", "M", "M", "M")
)
supp.dt <- data.table(
crits = c("A", "B", "C"),
ID = c("ID.1", "ID.2", "ID.3")
)
value.dt <- data.table(
ID.1M = runif(100, 0, 1),
ID.1F = runif(100, 0, 1),
ID.2M = runif(100, 1, 2),
ID.2F = runif(100, 1, 2),
ID.3M = runif(100, 2, 3),
ID.3F = runif(100, 2, 3)
)
getValue <- function(crit=NULL, numb=NULL, gend=NULL) {
return(value.dt[numb, paste0(supp.dt[crits == crit]$ID, gend), with = F])
}
dt$value <- mapply(getValue, dt$criteria, dt$number, dt$gender)
Benchmark and results:
library(microbenchmark)
runmapply <- function() {
dt$value <- mapply(getValue, dt$criteria, dt$number, dt$gender)
}
microbenchmark(
runmapply()
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# runmapply() 5.5528 5.979 7.912311 6.4392 8.4442 24.1152 100
The method here seems to be alright, but
- my data has up to million rows
- I need to extract for more than one value, in fact I working with 13 different values; I stored them as a list of data.tables in the first place and
In summary it takes too long for the amount of data and the different values.
Thanks in advance for any ideas for optimization.
UPDATE:
I thought about shaping and joining tables. But had no idea how to approach. I guess I didn't know what data.table was capable of. Thanks @langtang
To be more specific about those 13 tables I mentioned: They have all data.tables with 100 rows but with different amount of columns. I should also mention the names of the columns vary, so the column names ID.1 to ID.3 in the example are set wrongly. But this can easily solved with ID=str_sub(ID, 1, length(ID)).
With more value tables, we can clearly see, that the criteria gives us the connection to the supp.dt, which on the other hand held all the information to the 13 data.tables.
So we could look at the example with two value.dt's and I also change those column names a bit:
library(data.table)
dt <- data.table(
person = c("Rick", "Michelle", "Richard", "Ryan", "Larry"),
criteria = c("A", "B", "C", "A", "C"),
number = c(5, 62, 25, 77, 91),
gender = c("M", "F", "M", "M", "M")
)
supp.dt <- data.table(
crits = c("A", "B", "C"),
valueoneID = c("uno", "dos", "tres"),
valuetwoID = c("eins", "zwei", "drei")
) # for every valuetable there is a respective id column
valueone.dt <- data.table(
unoM = runif(100, 0, 1),
unoF = runif(100, 0, 1),
dosM = runif(100, 1, 2),
dosF = runif(100, 1, 2),
tresM = runif(100, 2, 3),
tresF = runif(100, 2, 3)
)
valuetwo.dt <- data.table(
einsM = runif(100, 0, 1),
einsF = runif(100, 0, 1),
zweiM = runif(100, 1, 2),
zweiF = runif(100, 1, 2),
dreiM = runif(100, 2, 3),
dreiF = runif(100, 2, 3)
)
My expected output should be like: (the ID column is not necessary)
gender person number valueone valuetwo
<char> <char> <int> <num> <num>
1: M Rick 5 0.8572478 0.2312414
2: M Ryan 77 0.6211473 0.8585884
3: F Michelle 62 1.8570321 1.2232323
4: M Richard 25 2.5732931 2.2323179
5: M Larry 91 2.0300149 2.0919987
CodePudding user response:
I think the approach will depend on the exact structure of your multiple values (13, and where these are stored), but you could think about using joins, etc.
melt(value.dt[, number:=.I],id.vars = "number", variable.name = "ID")[
,`:=`(ID=str_sub(ID, 1,4), gender=str_sub(ID,-1,-1))][
dt[supp.dt,on=c("criteria"="crits")],
on=.(ID,gender,number)]
Output:
number ID value gender person criteria
<int> <char> <num> <char> <char> <char>
1: 5 ID.1 0.8572478 M Rick A
2: 77 ID.1 0.6211473 M Ryan A
3: 62 ID.2 1.8570321 F Michelle B
4: 25 ID.3 2.5732931 M Richard C
5: 91 ID.3 2.0300149 M Larry C
Update
If you have a number of "value" frames, you can take the following approach:
First, put them in a named list
value_frames = list("valueoneID" = valueone.dt,"valuetwoID" = valuetwo.dt)
Second, create a single value.dt
data.table that row-binds melted versions of these different frames from value_frames
value.dt = rbindlist(
lapply(value_frames, \(f) melt(f[,number:=.I], id.vars="number", variable.name="ID")),
idcol = "vsrc"
)
Then, join value.dt
to a melted version of a join between dt
and supp.dt
, and dcast the result back to wide.
dcast(
value.dt[,`:=`(ID = str_sub(ID,1,-2), gender=str_sub(ID,-1,-1))][
melt(dt[supp.dt,on=c("criteria"="crits")],
measure.vars = patterns("value"),
variable.name = "vsrc",
value.name = "ID"),
on=.(ID,gender,number,vsrc)],
gender person number~vsrc, value.var="value"
)