Using `fifelse` within data.table `by`, where test column is part of `by`-CodePudding

I want to add a column to a data.table that is a sequence within groups defined using by, but using a condition on one of the columns used in the by clause. I tried using fifelse as in the following example:

dt <- data.table::data.table(
  id = c(1, 1, 2, 2, 2, 3),
  clk = c(1, 1, 0, 2, 2, 5),
  val = LETTERS[1:6]
)

dt[, seq_clk := fifelse(clk != 0, seq_len(.N), NA_integer_), by = .(id, clk)]

This results in the following error

Error in fifelse(clk != 0, seq_len(.N), NA_integer_) : Length of 'yes' is 2 but must be 1 or length of 'test' (1).

The result I'm expecting to get can be achieved by the following code

dt[, seq_2 := seq_len(.N), by = .(id, clk)][
, seq_clk := fifelse(clk != 0, seq_2, NA_integer_)][
  , seq_2 := NULL]

which gives

   id clk val seq_clk
1:  1   1   A       1
2:  1   1   B       2
3:  2   0   C      NA
4:  2   2   D       1
5:  2   2   E       2
6:  3   5   F       1

Although the above code works, I don't understand why the one-liner in the first example does not work. It seems that the issue is with applying fifelse to a column listed in the by clause. It works fine with columns not in the by.

I've also noticed that other functions do not work as I expect in this situation. For example:

dt[, sum_id_by_clk := fifelse(clk != 0, sum(id), NA_integer_), by = .(id, clk)]

does not give an error, but produces incorrect results:

   id clk val sum_id_by_clk
1:  1   1   A             1
2:  1   1   B             1
3:  2   0   C            NA
4:  2   2   D             2
5:  2   2   E             2
6:  3   5   F             3

I'd expect the values in the last column to be 2 for rows 1-2 and 4 for rows 4-5.

What am I missing here?

CodePudding user response：

If I understand correctly, the OP wants to number the columns within each id, clk group while skipping rows where clk == 0

For this, data.table's rowid() function can be used conveniently:

dt[clk != 0, seq_clk := rowid(id, clk)][]

      id   clk    val seq_clk
1:     1     1      A       1
2:     1     1      B       2
3:     2     0      C      NA
4:     2     2      D       1
5:     2     2      E       2
6:     3     5      F       1

Row 3 was excluded from the update by reference and gets NA by default.

The OP was wondering why calculations on grouping variables do not give the expected result. Here is an attempt of an explanation.

According to ?data.table, the general form of data.table syntax is:

DT[ i,  j,  by ] #   extra arguments
    |   |   |
    |   |    -------> grouped by what?
    |    -------> what to do?
     ---> on which rows?

The way to read this out loud is: "Take DT, subset rows by i, then compute j grouped by by.

So, normally variables appearing in by are not meant to be used in a j expression. This is further underlined by the explanation of the special symbol .SD (see ?.SD):

.SD is a data.table containing the Subset of x's Data for each group, excluding any columns used in by (or keyby).

This can be verified by

dt[clk != 0, print(.SD), by = .(id, clk)][]

      val seq_clk
1:      A       1
2:      B       2
      val seq_clk
1:      D       1
2:      E       2
      val seq_clk
1:      F       1
Empty data.table (0 rows and 2 cols): id,clk

The grouping variables can be accessed via the special symbol .BY, e.g.,

dt[clk != 0, str(.BY), by = .(id, clk)][]

List of 2
 $ id : num 1
 $ clk: num 1
List of 2
 $ id : num 2
 $ clk: num 2
List of 2
 $ id : num 3
 $ clk: num 5
Empty data.table (0 rows and 2 cols): id,clk

So, each grouping variable exists once only for each group which explains why sum(id) returns only id.

CodePudding user response：

We may change the name of the grouping column

dt[, seq_clk := fifelse(clk != 0, seq_len(.N), NA_integer_),
     by = .(id, clk2 = clk)]

-output

> dt
      id   clk    val seq_clk
   <num> <num> <char>   <int>
1:     1     1      A       1
2:     1     1      B       2
3:     2     0      C      NA
4:     2     2      D       1
5:     2     2      E       2
6:     3     5      F       1

Or rep the test part

dt[, seq_clk := fifelse(rep(clk, .N) != 0, seq_len(.N), 
       NA_integer_), by = .(id, clk)]
> 
> dt
      id   clk    val seq_clk
   <num> <num> <char>   <int>
1:     1     1      A       1
2:     1     1      B       2
3:     2     0      C      NA
4:     2     2      D       1
5:     2     2      E       2
6:     3     5      F       1

It is possible that the grouping column when it is used in the condition, it uses only the first element

> dt[, length(clk) , clk]
     clk    V1
   <num> <int>
1:     1     1
2:     0     1
3:     2     1
4:     5     1

whereas by changing the grouping column

> dt[, length(clk) , .(clk2 = clk)]
    clk2    V1
   <num> <int>
1:     1     2
2:     0     1
3:     2     2
4:     5     1

and fifelse/ifelse all requires all arguments to be of same length (although the NA_integer_ is of length 1 - which recycles, it is better to replicate that as well)