Flagging the first row by group in data.table-CodePudding

In the data.table below, I want to flag the first row by each group.

  temp_dt <- data.table(date = as.Date(c("2000-01-01","2000-03-31","2000-07-01","2000-09-30", 
                                     "2001-01-01","2001-03-31","2001-07-01","2001-09-30",
                                     "2000-01-01","2000-03-31","2000-07-01","2000-09-30", 
                                     "2001-01-01","2001-03-31","2001-07-01","2001-09-30",
                                     "2000-01-01","2000-03-31","2000-07-01","2000-09-30", 
                                     "2001-01-01","2001-03-31","2001-07-01","2001-09-30")),
                    group = c(1,1,1,1,1,1,1,1,
                              2,2,6,6,6,8,8,8,
                              3,3,3,3,4,4,4,4))

Following is the expected outcome after adding the flag.

> temp_dt
          date group flag
 1: 2000-01-01     1    1
 2: 2000-03-31     1    0
 3: 2000-07-01     1    0
 4: 2000-09-30     1    0
 5: 2001-01-01     1    0
 6: 2001-03-31     1    0
 7: 2001-07-01     1    0
 8: 2001-09-30     1    0
 9: 2000-01-01     2    1
10: 2000-03-31     2    0
11: 2000-07-01     6    1
12: 2000-09-30     6    0
13: 2001-01-01     6    0
14: 2001-03-31     8    1
15: 2001-07-01     8    0
16: 2001-09-30     8    0
17: 2000-01-01     3    1
18: 2000-03-31     3    0
19: 2000-07-01     3    0
20: 2000-09-30     3    0
21: 2001-01-01     4    1
22: 2001-03-31     4    0
23: 2001-07-01     4    0
24: 2001-09-30     4    0
          date group flag

Here is the solution I tried (it is fast) but it is not working as expected.

temp_dt[, flag := if(identical(.I, 1)) 1 else 0, by = .(group)]

Since I am dealing with millions of rows, performance is key in the solution. I am looking for only data.table solution.

Other solutions provided on SO are too slow for my requirement.

CodePudding user response：

You can use rowid function. It will give incremental index starting at 1 for each group defined by given grouping variable(s). Detecting first line is simply done by comparing to 1.

temp_dt[, flag := rowid(group)==1]

CodePudding user response：

This way finds the earliest row by date per group and then sets flag == 1 for those rows.

temp_dt[temp_dt[, .I[date == min(date)], by = .(group)]$V1, flag := 1]

# set the rest of the column to 0
temp_dt[is.na(flag), flag := 0]