I have huge dataset (over 100 Mio rows) with traffic data and want to create an additional column that says in which quantile the traffic of this row lies.
MWE
library(data.table)
# Demo dataset
test = data.table(a=c(1,2,3,4,5,6,7,8,9))
# Demo numeric vector
tocheck = c("TypeA" = 1, "TypeB" = 5)
Expected result
It should return the last name of the vector according to the value in column a
:
> last(names(tocheck)[tocheck < 2])
[1] "TypeA"
> last(names(tocheck)[tocheck < 7])
[1] "TypeB"
The final dataset should looks like:
data.table(a=c(1,2,3,4,5,6,7,8,9),
Check=c("TypeA","TypeA","TypeA","TypeA","TypeA",
"TypeB","TypeB","TypeB","TypeB"))
| a|Check |
|--:|:-----|
| 1|TypeA |
| 2|TypeA |
| 3|TypeA |
| 4|TypeA |
| 5|TypeA |
| 6|TypeB |
| 7|TypeB |
| 8|TypeB |
| 9|TypeB |
What I've tried
> test[, Check := last(names(tocheck)[tocheck < a])]
But this gives the following warning and no result:
Warning message:
In tocheck < a :
longer object length is not a multiple of shorter object length
And with google I haven't found a solution that I could adapt.
Question
How can I solve this with main focus on speed/performance (>100 Mio rows, 6GB of data)?
Thank you for any advice and suggestion.
CodePudding user response:
We can use findInterval
library(data.table)
test[, Check := names(tocheck)[findInterval(a, tocheck, rightmost.closed = TRUE)]]
Or with data.table
, this can be done using a non-equi join as well after converting the named vector
to a data.frame/data.table (stack
converts to two column data.frame)
test[stack(tocheck), Check := ind, on = .(a > values)]