Home > Net >  Flagging the first row by group in data.table
Flagging the first row by group in data.table

Time:01-31

In the data.table below, I want to flag the first row by each group.

  temp_dt <- data.table(date = as.Date(c("2000-01-01","2000-03-31","2000-07-01","2000-09-30", 
                                     "2001-01-01","2001-03-31","2001-07-01","2001-09-30",
                                     "2000-01-01","2000-03-31","2000-07-01","2000-09-30", 
                                     "2001-01-01","2001-03-31","2001-07-01","2001-09-30",
                                     "2000-01-01","2000-03-31","2000-07-01","2000-09-30", 
                                     "2001-01-01","2001-03-31","2001-07-01","2001-09-30")),
                    group = c(1,1,1,1,1,1,1,1,
                              2,2,6,6,6,8,8,8,
                              3,3,3,3,4,4,4,4))

Following is the expected outcome after adding the flag.

> temp_dt
          date group flag
 1: 2000-01-01     1    1
 2: 2000-03-31     1    0
 3: 2000-07-01     1    0
 4: 2000-09-30     1    0
 5: 2001-01-01     1    0
 6: 2001-03-31     1    0
 7: 2001-07-01     1    0
 8: 2001-09-30     1    0
 9: 2000-01-01     2    1
10: 2000-03-31     2    0
11: 2000-07-01     6    1
12: 2000-09-30     6    0
13: 2001-01-01     6    0
14: 2001-03-31     8    1
15: 2001-07-01     8    0
16: 2001-09-30     8    0
17: 2000-01-01     3    1
18: 2000-03-31     3    0
19: 2000-07-01     3    0
20: 2000-09-30     3    0
21: 2001-01-01     4    1
22: 2001-03-31     4    0
23: 2001-07-01     4    0
24: 2001-09-30     4    0
          date group flag

Here is the solution I tried (it is fast) but it is not working as expected.

temp_dt[, flag := if(identical(.I, 1)) 1 else 0, by = .(group)]

Since I am dealing with millions of rows, performance is key in the solution. I am looking for only data.table solution.

Other solutions provided on SO are too slow for my requirement.

CodePudding user response:

You can use rowid function. It will give incremental index starting at 1 for each group defined by given grouping variable(s). Detecting first line is simply done by comparing to 1.

temp_dt[, flag := rowid(group)==1]

CodePudding user response:

This way finds the earliest row by date per group and then sets flag == 1 for those rows.

temp_dt[temp_dt[, .I[date == min(date)], by = .(group)]$V1, flag := 1]

# set the rest of the column to 0
temp_dt[is.na(flag), flag := 0]
  • Related