I am new to the data.table package, and am semi-new to R, and would like to use data.table because of its speed in working with very large data sets.
Suppose we start with this dataframe called "data
", generated by the code beneath it:
> data
ID Period_1 Period_2 Values State
1: 1 1 2020-01 5 X0
2: 1 2 2020-02 10 X1
3: 1 3 2020-03 15 X2
4: 2 1 2020-04 0 X0
5: 2 2 2020-05 2 X2
6: 2 3 2020-06 4 X0
7: 3 1 2020-02 3 X2
8: 3 2 2020-03 6 X1
9: 3 3 2020-04 9 X0
data <-
data.frame(
ID = c(1,1,1,2,2,2,3,3,3),
Period_1 = c(1, 2, 3, 1, 2, 3, 1, 2, 3),
Period_2 = c("2020-01","2020-02","2020-03","2020-04","2020-05","2020-06","2020-02","2020-03","2020-04"),
Values = c(5, 10, 15, 0, 2, 4, 3, 6, 9),
State = c("X0","X1","X2","X0","X2","X0", "X2","X1","X0")
)
I would like to create a new dataframe showing all the "states" flowing into a user-specified target state (call it "X") over time, as measured by Period_1 in this data
dataframe. For Period_1 = 1, we simply count all instances of target state X. For all Period_1 > 1, for any row with state = X, all instances of X are counted and placed into the corresponding row of the dataframe reflecting the state in the immediately prior Period_1 (for the same ID). How could this be efficiently done using data.table?
The image below illustrates this better, where the newly derived dataframe shows all states flowing into target state x0 over time:
I include in data
other columns (Period_2 and Values) for use as this function later evolves, for alternatively defining the time horizon as Period_2 and for summing the flow of Values rather than counting in instances of state. I should be able to do those on my own after someone gives me a kick start with the request above.
CodePudding user response:
Here is one option:
f <- function(s) {
dcast(
rbind(unique(data[,.(State,Period_1,N=0)]),
data[, priorState:=lag(State), by = ID] %>%
.[State==s] %>%
.[!is.na(priorState), State:=priorState] %>%
.[, .N, .(State,Period_1)]
),
State~Period_1, value.var="N", fun.aggregate=sum
)
}
setDT(data)
f("X0")
Output:
State 1 2 3
<char> <num> <num> <num>
1: X0 2 0 0
2: X1 0 0 1
3: X2 0 0 1