In the data.table below, I want to flag the first row by each group.
temp_dt <- data.table(date = as.Date(c("2000-01-01","2000-03-31","2000-07-01","2000-09-30",
"2001-01-01","2001-03-31","2001-07-01","2001-09-30",
"2000-01-01","2000-03-31","2000-07-01","2000-09-30",
"2001-01-01","2001-03-31","2001-07-01","2001-09-30",
"2000-01-01","2000-03-31","2000-07-01","2000-09-30",
"2001-01-01","2001-03-31","2001-07-01","2001-09-30")),
group = c(1,1,1,1,1,1,1,1,
2,2,6,6,6,8,8,8,
3,3,3,3,4,4,4,4))
Following is the expected outcome after adding the flag.
> temp_dt
date group flag
1: 2000-01-01 1 1
2: 2000-03-31 1 0
3: 2000-07-01 1 0
4: 2000-09-30 1 0
5: 2001-01-01 1 0
6: 2001-03-31 1 0
7: 2001-07-01 1 0
8: 2001-09-30 1 0
9: 2000-01-01 2 1
10: 2000-03-31 2 0
11: 2000-07-01 6 1
12: 2000-09-30 6 0
13: 2001-01-01 6 0
14: 2001-03-31 8 1
15: 2001-07-01 8 0
16: 2001-09-30 8 0
17: 2000-01-01 3 1
18: 2000-03-31 3 0
19: 2000-07-01 3 0
20: 2000-09-30 3 0
21: 2001-01-01 4 1
22: 2001-03-31 4 0
23: 2001-07-01 4 0
24: 2001-09-30 4 0
date group flag
Here is the solution I tried (it is fast) but it is not working as expected.
temp_dt[, flag := if(identical(.I, 1)) 1 else 0, by = .(group)]
Since I am dealing with millions of rows, performance is key in the solution. I am looking for only data.table solution.
Other solutions provided on SO are too slow for my requirement.
CodePudding user response:
You can use rowid
function. It will give incremental index starting at 1
for each group defined by given grouping variable(s). Detecting first line is simply done by comparing to 1
.
temp_dt[, flag := rowid(group)==1]
CodePudding user response:
This way finds the earliest row by date per group and then sets flag == 1
for those rows.
temp_dt[temp_dt[, .I[date == min(date)], by = .(group)]$V1, flag := 1]
# set the rest of the column to 0
temp_dt[is.na(flag), flag := 0]