I have been running into a computation time issue when working with large datasets and applying functions that use multiple columns of the data.table as inputs. Here is a simplified example:
require(data.table)
main_dt <- data.table(Start=(1:2), End=c(2,2))
lookup_dt <- data.table(Year = 1:3, Amount = c(10,20,30))
So:
> main_dt
Start End
1: 1 2
2: 2 2
And
> lookup_dt
Year Amount
1: 1 10
2: 2 20
3: 3 30
What I want is to add a column in data.table main_dt that includes the sum of Amounts of Years between Start and End. So something like this, but there is a referencing error.
main_dt[, Amount := lookup_dt[ Year >= start & Year <= end, sum(Amount)]]
Warning messages:
1: In Year >= Start : longer object length is not a multiple of shorter object length
2: In Year <= End : longer object length is not a multiple of shorter object length
If the Amount would depend only on one variable, I could do:
main_dt[, GreaterAmount := lapply(Start, function(x) return (lookup_dt[Year >= x, sum(Amount)]))]
to get:
> main_dt
Start End Amount GreaterAmount
1: 1 2 30 60
2: 2 2 30 50
Which would be ok, but
- I need to apply a function that actually depends on both Start and End
- Applying a function for every row individually slows down computation process rapidly. A vectorized solution would be highly appreciated.
Looking forward to any suggestions!
Thank you! Markus
CodePudding user response:
First, the join columns should be the same class, so we can either convert main_dt$End
to integer, or main_df$Start
and lookup_dt$Year
to numeric. I'll choose the first:
main_dt[, End := as.integer(End)]
main_dt
# Start End
# <int> <int>
# 1: 1 2
# 2: 2 2
From here, we can do a joining-assignment:
main_dt[, Amount := lookup_dt[.SD, sum(Amount), on = .(Year >= Start, Year <= End), by = .EACHI]$V1 ]
main_dt
# Start End Amount
# <int> <int> <num>
# 1: 1 2 30
# 2: 2 2 20
If you're somewhat familiar with data.table
, note that .SD
referenced is actually the contents of main_dt
, so lookup_dt[.SD,...]
is effectively "main_dt left join lookup_dt". From there, the on=
should be normal, and sum(Amount)
is what you want to aggregate. The only new thing introduced here is the use of by=.EACHI
, which can be confusing; some links for that: