I am using the data.table package to work with a very large data set, and value it's speed and clarity. But I am new to it and am having difficulties chaining functions together especially when working with a mixed set of data.table and base R functions. My question is, how do I chain together the below example functions, into one seamless string of code for defining the target data
object?
Below is the correct output, generated by running each line of code separately (unchained) with the generating code shown immediately beneath the output:
> data
ID Period State Values
1: 1 1 X0 5
2: 1 2 X1 0
3: 1 3 X2 0
4: 1 4 X1 0
5: 2 1 X0 1
6: 2 2 XX 0
7: 2 3 XX 0
8: 2 4 XX 0
9: 3 1 X2 0
10: 3 2 X1 0
11: 3 3 X9 0
12: 3 4 X3 0
13: 4 1 X2 1
14: 4 2 X1 2
15: 4 3 X9 3
16: 4 4 XX 0
library(data.table)
data <-
data.frame(
ID = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4),
Period = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4),
Values_1 = c(5, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 0),
Values_2 = c(5, 2, 0, 12, 2, 0, 0, 0, 0, 0, 0, 2, 4, 5, 6, 0),
State = c("X0","X1","X2","X1","X0","X2","X0","X0", "X2","X1","X9","X3", "X2","X1","X9","X3")
)
# changes State to "XX" if remaining Values_1 Values_2 cumulative sums = 0 for each ID:
setDT(data)[, State := ifelse(rev(cumsum(rev(Values_1 Values_2))), State, "XX"), ID]
# create new column "Values", which equals "Values_1":
setDT(data)[,Values := Values_1]
# in base R, drops columns Values_1 and Values_2:
data <- subset(data, select = -c(Values_1,Values_2)) # How to do this step in data.table, if possible or advisable?
# in base R, changes all "XX" elements in State column to "HI":
data$State <- gsub('XX','HI', data$State) # How to do this step in data.table, if possible or advisable?
For what it's worth, below is my attempt to chain together using '%>%' pipe operators, which fails (error message Error in data$State : object of type 'closure' is not subsettable), and though I'd rather chain together using data.table operators:
data <-
data.frame(
ID = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4),
Period = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4),
Values_1 = c(5, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 0),
Values_2 = c(5, 2, 0, 12, 2, 0, 0, 0, 0, 0, 0, 2, 4, 5, 6, 0),
State = c("X0","X1","X2","X1","X0","X2","X0","X0", "X2","X1","X9","X3", "X2","X1","X9","X3")
) %>%
setDT(data)[, State := ifelse(rev(cumsum(rev(Values_1 Values_2))), State, "XX"), ID] %>%
setDT(data)[,Values := Values_1] %>%
subset(data, select = -c(Values_1,Values_2)) %>%
data$State <- gsub('XX','HI', data$State)
CodePudding user response:
You can just chain using bracket notation [
. That way you only need to call setDT()
once, as you are continuing all operations in the data.table
universe, so data
does not stop being a data.table
. Also setDT()
modifies in place, so it does not need assignment (although by piping to it its return value is being assigned to data
which is fine, too).
First define the data and make it a data.table
:
library(data.table)
data <-
data.frame(
ID = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4),
Period = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4),
Values_1 = c(5, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 0),
Values_2 = c(5, 2, 0, 12, 2, 0, 0, 0, 0, 0, 0, 2, 4, 5, 6, 0),
State = c("X0", "X1", "X2", "X1", "X0", "X2", "X0", "X0", "X2", "X1", "X9", "X3", "X2", "X1", "X9", "X3")
) |>
setDT()
Then define the columns you need. Note the functional notation to apply a function on several columns.
data[, `:=`(
State = ifelse(
rev(cumsum(rev(Values_1 Values_2))),
State, "XX"
)
),
by = ID
][
,
`:=`(
Values = Values_1,
Values_1 = NULL,
Values_2 = NULL,
State = gsub("XX", "HI", State)
)
]
Output:
data
# ID Period State Values
# 1: 1 1 X0 5
# 2: 1 2 X1 0
# 3: 1 3 X2 0
# 4: 1 4 X1 0
# 5: 2 1 X0 1
# 6: 2 2 HI 0
# 7: 2 3 HI 0
# 8: 2 4 HI 0
# 9: 3 1 X2 0
# 10: 3 2 X1 0
# 11: 3 3 X9 0
# 12: 3 4 X3 0
# 13: 4 1 X2 1
# 14: 4 2 X1 2
# 15: 4 3 X9 3
# 16: 4 4 HI 0
You may want to read further about chaining commands in data.table. I think that page is an excellent summary of the syntax and features of the package and is worth reading in its entirety.
CodePudding user response:
You can use the magrittr
package to chaining data.tables using .
before [
. Try the following code:
library(dplyr)
library(magrittr)
library(data.table)
data <-
data.frame(
ID = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4),
Period = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4),
Values_1 = c(5, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 0),
Values_2 = c(5, 2, 0, 12, 2, 0, 0, 0, 0, 0, 0, 2, 4, 5, 6, 0),
State = c("X0","X1","X2","X1","X0","X2","X0","X0", "X2","X1","X9","X3", "X2","X1","X9","X3")
) %>%
setDT(data) %>%
.[, State := ifelse(rev(cumsum(rev(Values_1 Values_2))), State, "XX"), ID] %>%
.[,Values := Values_1] %>%
select(-c(Values_1, Values_2)) %>%
mutate(State = gsub('XX','HI', State))
Output:
rn ID Period State Values
1: 1 1 1 X0 5
2: 2 1 2 X1 0
3: 3 1 3 X2 0
4: 4 1 4 X1 0
5: 5 2 1 X0 1
6: 6 2 2 HI 0
7: 7 2 3 HI 0
8: 8 2 4 HI 0
9: 9 3 1 X2 0
10: 10 3 2 X1 0
11: 11 3 3 X9 0
12: 12 3 4 X3 0
13: 13 4 1 X2 1
14: 14 4 2 X1 2
15: 15 4 3 X9 3
16: 16 4 4 HI 0