I want to add a column to a data.table
that is a sequence within groups defined using by
, but using a condition on one of the columns used in the by
clause. I tried using fifelse
as in the following example:
dt <- data.table::data.table(
id = c(1, 1, 2, 2, 2, 3),
clk = c(1, 1, 0, 2, 2, 5),
val = LETTERS[1:6]
)
dt[, seq_clk := fifelse(clk != 0, seq_len(.N), NA_integer_), by = .(id, clk)]
This results in the following error
Error in fifelse(clk != 0, seq_len(.N), NA_integer_) : Length of 'yes' is 2 but must be 1 or length of 'test' (1).
The result I'm expecting to get can be achieved by the following code
dt[, seq_2 := seq_len(.N), by = .(id, clk)][
, seq_clk := fifelse(clk != 0, seq_2, NA_integer_)][
, seq_2 := NULL]
which gives
id clk val seq_clk
1: 1 1 A 1
2: 1 1 B 2
3: 2 0 C NA
4: 2 2 D 1
5: 2 2 E 2
6: 3 5 F 1
Although the above code works, I don't understand why the one-liner in the first example does not work. It seems that the issue is with applying fifelse
to a column listed in the by
clause. It works fine with columns not in the by
.
I've also noticed that other functions do not work as I expect in this situation. For example:
dt[, sum_id_by_clk := fifelse(clk != 0, sum(id), NA_integer_), by = .(id, clk)]
does not give an error, but produces incorrect results:
id clk val sum_id_by_clk
1: 1 1 A 1
2: 1 1 B 1
3: 2 0 C NA
4: 2 2 D 2
5: 2 2 E 2
6: 3 5 F 3
I'd expect the values in the last column to be 2 for rows 1-2 and 4 for rows 4-5.
What am I missing here?
CodePudding user response:
If I understand correctly, the OP wants to number the columns within each id, clk
group while skipping rows where clk == 0
For this, data.table's rowid()
function can be used conveniently:
dt[clk != 0, seq_clk := rowid(id, clk)][]
id clk val seq_clk 1: 1 1 A 1 2: 1 1 B 2 3: 2 0 C NA 4: 2 2 D 1 5: 2 2 E 2 6: 3 5 F 1
Row 3 was excluded from the update by reference and gets NA
by default.
The OP was wondering why calculations on grouping variables do not give the expected result. Here is an attempt of an explanation.
According to ?data.table
, the general form of data.table syntax is:
DT[ i, j, by ] # extra arguments
| | |
| | -------> grouped by what?
| -------> what to do?
---> on which rows?
The way to read this out loud is: "Take DT
, subset rows by i
, then compute j
grouped by by
.
So, normally variables appearing in by
are not meant to be used in a j
expression. This is further underlined by the explanation of the special symbol .SD
(see ?.SD
):
.SD
is adata.table
containing the Subset ofx
's Data for each group, excluding any columns used inby
(orkeyby
).
This can be verified by
dt[clk != 0, print(.SD), by = .(id, clk)][]
val seq_clk 1: A 1 2: B 2 val seq_clk 1: D 1 2: E 2 val seq_clk 1: F 1 Empty data.table (0 rows and 2 cols): id,clk
The grouping variables can be accessed via the special symbol .BY
, e.g.,
dt[clk != 0, str(.BY), by = .(id, clk)][]
List of 2 $ id : num 1 $ clk: num 1 List of 2 $ id : num 2 $ clk: num 2 List of 2 $ id : num 3 $ clk: num 5 Empty data.table (0 rows and 2 cols): id,clk
So, each grouping variable exists once only for each group which explains why sum(id)
returns only id
.
CodePudding user response:
We may change the name of the grouping column
dt[, seq_clk := fifelse(clk != 0, seq_len(.N), NA_integer_),
by = .(id, clk2 = clk)]
-output
> dt
id clk val seq_clk
<num> <num> <char> <int>
1: 1 1 A 1
2: 1 1 B 2
3: 2 0 C NA
4: 2 2 D 1
5: 2 2 E 2
6: 3 5 F 1
Or rep
the test
part
dt[, seq_clk := fifelse(rep(clk, .N) != 0, seq_len(.N),
NA_integer_), by = .(id, clk)]
>
> dt
id clk val seq_clk
<num> <num> <char> <int>
1: 1 1 A 1
2: 1 1 B 2
3: 2 0 C NA
4: 2 2 D 1
5: 2 2 E 2
6: 3 5 F 1
It is possible that the grouping column when it is used in the condition, it uses only the first element
> dt[, length(clk) , clk]
clk V1
<num> <int>
1: 1 1
2: 0 1
3: 2 1
4: 5 1
whereas by changing the grouping column
> dt[, length(clk) , .(clk2 = clk)]
clk2 V1
<num> <int>
1: 1 2
2: 0 1
3: 2 2
4: 5 1
and fifelse/ifelse
all requires all arguments to be of same length (although the NA_integer_
is of length 1 - which recycles, it is better to replicate that as well)