Compare whether incident is new or already exists-CodePudding

There is an algorithm that identifies issues/incidents in a network. After that it writes all the cases into a DB. Lets say it looks like (simplified):

ID	Date	Case
A1	2022-01-01	1
B1	2022-01-01	2
C1	2022-01-01	3
A1	2022-01-02	NA
C1	2022-01-02	NA
A1	2022-01-03	NA
B1	2022-01-03	NA
C1	2022-01-03	NA

Each row represents an incident.
Now I want to identify if an incident already existed the last time we ran this script. For this it should check the actual Date and compare it with the last existing date in the table.
Note: it can occur that the last day was not yesterday, it might be that there is a difference up to 7 days.

So the logical is:

compare the second highest date value in this ID group with the second highest Date from full df
If it's the same it means, case already exists. Then take it's last Case number. If it's new, create a new Case number (max(Case) 1)

Update 11.05.2022 - 17:02:

It should consider existing Case values and not overwrite them. Or with other words, it should overwrite/fill the NAs. There will never be NAs in between. The existing cases always have a number, the new ones doesn't. 100%.

Expected result:

ID	Date	Case	Comment
A1	2022-01-01	1
B1	2022-01-01	2
C1	2022-01-01	3
A1	2022-01-02	1
C1	2022-01-02	3
A1	2022-01-03	1
B1	2022-01-03	4	New case, as there wasn't B1 on 2022-01-02
C1	2022-01-03	3

I was able to identify the second highest Dates:

> df[, nth(unique(Date),length(unique(Date))-1), ID]
   ID         V1
1: A1 2022-01-02 ## TRUE, as it's the second highest Date
2: B1 2022-01-01 ## FALSE, as it's not the second highest Date
3: C1 2022-01-02 ## TRUE, as it's the second highest Date
> df[, nth(unique(Date),length(unique(Date))-1)]
[1] "2022-01-02" ## Second highest Date in df

But now I'm struggling with creating a new column with this condition. Can somebody please help? data.table solution prefered, but dplyr is great as well.

MWE

library(data.table)

df = data.table(ID=c("A1", "B1", "C1", "A1", "C1", "A1", "B1", "C1"),
            Date=as.Date(c("2022-01-01","2022-01-01","2022-01-01","2022-01-02","2022-01-02","2022-01-03", "2022-01-03", "2022-01-03")),
            Case = NA)


Goal = data.table(ID=c("A1", "B1", "C1", "A1", "C1", "A1", "B1", "C1"),
                Date=as.Date(c("2022-01-01","2022-01-01","2022-01-01","2022-01-02","2022-01-02","2022-01-03", "2022-01-03", "2022-01-03")),
                Case=c(1,2,3,1,3,1,4,3))

CodePudding user response：

How about this:

df[order(Date), d:=c(1,diff(Date)), by = ID][
  order(d,ID),case:=rleid(ID,d)][
    ,d:=NULL]

Output:

   ID       Date case
1: A1 2022-01-01    1
2: B1 2022-01-01    2
3: C1 2022-01-01    3
4: A1 2022-01-02    1
5: C1 2022-01-02    3
6: A1 2022-01-03    1
7: B1 2022-01-03    4
8: C1 2022-01-03    3

If you actually want the comment column, you can refine the above, like this:

df[order(Date), d:=c(1,diff(Date)), by = ID][
  order(d,ID),`:=`(
    case=rleid(ID,d),
    comment=fifelse(d!=1,paste0("New case, as there was no ", ID, " on ",Date-1),""))][
      ,d:=NULL][]

Output:

   ID       Date case                                    comment
1: A1 2022-01-01    1                                           
2: B1 2022-01-01    2                                           
3: C1 2022-01-01    3                                           
4: A1 2022-01-02    1                                           
5: C1 2022-01-02    3                                           
6: A1 2022-01-03    1                                           
7: B1 2022-01-03    4 New case, as there was no B1 on 2022-01-02
8: C1 2022-01-03    3