I'm having a data.table:
library(data.table)
dt = structure(list(date = c("01.01.2020", "01.01.2020", "02.01.2020",
"02.01.2020", "03.01.2020", "03.01.2020", "04.01.2020", "04.01.2020"
), name = c("10AFC25D", "FA1A310C", "10AFC25D", "FA1A310C", "10AFC25D",
"FA1A310C", "10AFC25D", "FA1A310C"), value = c(100L, 50L, 80L,
60L, 70L, 60L, 50L, 80L)), row.names = c(NA, -8L), class = c("data.table", "data.frame"))
dt[, date:=as.Date(date, format="%d.%m.%Y")]
That looks like:
> dt
date name value
1: 01.01.2020 10AFC25D 100
2: 01.01.2020 FA1A310C 50
3: 02.01.2020 10AFC25D 80
4: 02.01.2020 FA1A310C 60
5: 03.01.2020 10AFC25D 70
6: 03.01.2020 FA1A310C 60
7: 04.01.2020 10AFC25D 50
8: 04.01.2020 FA1A310C 80
Goal: I want to calculate two new columns which gives the difference to last day value. One column shows the absolute difference and the other the relative difference. And the formular should be that flexibel that I could change the lag from 1 day to 7 days (if I want to compare the same weekdays) - or any other value.
The expected output should look like:
date name value diff_absolut diff_relative
1: 01.01.2020 10AFC25D 100 NA NA
2: 01.01.2020 FA1A310C 50 NA NA
3: 02.01.2020 10AFC25D 80 -20 -0.2000000
4: 02.01.2020 FA1A310C 60 10 0.2000000
5: 03.01.2020 10AFC25D 70 -10 -0.1250000
6: 03.01.2020 FA1A310C 60 0 0.0000000
7: 04.01.2020 10AFC25D 50 -20 -0.2857143
8: 04.01.2020 FA1A310C 80 20 0.3333333
I could solve it like:
dt2 = copy(dt)
dt2[, date:=date days(1)]
dt_final = merge(dt, dt2, by=c("date", "name"), all.x=TRUE, suffixes=c("", "_2"))
dt_final[, `:=`(diff_absolute=value-value_2, diff_relative=(value-value_2)/value_2, value_2=NULL)]
dt_final
date name value diff_absolute diff_relative
1: 2020-01-01 10AFC25D 100 NA NA
2: 2020-01-01 FA1A310C 50 NA NA
3: 2020-01-02 10AFC25D 80 -20 -0.2000000
4: 2020-01-02 FA1A310C 60 10 0.2000000
5: 2020-01-03 10AFC25D 70 -10 -0.1250000
6: 2020-01-03 FA1A310C 60 0 0.0000000
7: 2020-01-04 10AFC25D 50 -20 -0.2857143
8: 2020-01-04 FA1A310C 80 20 0.3333333
This works correctly, but it doesn't look really elegant and efficient. And as the original data has between 1 and 24 Mio rows I thought I better ask if someone has a smoother solution? Please data.table only. Thank you very much.
CodePudding user response:
If you think of this in terms of rows this should do:
lag = 2L
dt[, diff_absolut := shift(value, n = lag) - value]
dt[, diff_relative := diff_absolut / shift(value, n = lag)]
CodePudding user response:
like this?
for longer lag, set the n-argument in the shift
-functions
dt[, `:=`(diff_absolute = value - shift(value),
diff_relative = (value - shift(value)) / shift(value)),
by = .(name)][]