Home > Net >  Create new data.table column based on last(name) of named vector
Create new data.table column based on last(name) of named vector

Time:03-31

I have huge dataset (over 100 Mio rows) with traffic data and want to create an additional column that says in which quantile the traffic of this row lies.


MWE

library(data.table)
# Demo dataset
test = data.table(a=c(1,2,3,4,5,6,7,8,9))
# Demo numeric vector
tocheck = c("TypeA" = 1, "TypeB" = 5)

Expected result

It should return the last name of the vector according to the value in column a:

> last(names(tocheck)[tocheck < 2])
[1] "TypeA"
> last(names(tocheck)[tocheck < 7])
[1] "TypeB"

The final dataset should looks like:

data.table(a=c(1,2,3,4,5,6,7,8,9), 
       Check=c("TypeA","TypeA","TypeA","TypeA","TypeA",
               "TypeB","TypeB","TypeB","TypeB"))

|  a|Check |
|--:|:-----|
|  1|TypeA |
|  2|TypeA |
|  3|TypeA |
|  4|TypeA |
|  5|TypeA |
|  6|TypeB |
|  7|TypeB |
|  8|TypeB |
|  9|TypeB |

What I've tried

> test[, Check := last(names(tocheck)[tocheck < a])]

But this gives the following warning and no result:

Warning message:
In tocheck < a :
  longer object length is not a multiple of shorter object length

And with google I haven't found a solution that I could adapt.


Question

How can I solve this with main focus on speed/performance (>100 Mio rows, 6GB of data)?

Thank you for any advice and suggestion.

CodePudding user response:

We can use findInterval

library(data.table)
test[, Check := names(tocheck)[findInterval(a, tocheck, rightmost.closed = TRUE)]]

Or with data.table, this can be done using a non-equi join as well after converting the named vector to a data.frame/data.table (stack converts to two column data.frame)

test[stack(tocheck), Check := ind, on = .(a > values)]
  • Related