There is an algorithm that identifies issues/incidents in a network. After that it writes all the cases into a DB. Lets say it looks like (simplified):
ID | Date | Case |
---|---|---|
A1 | 2022-01-01 | 1 |
B1 | 2022-01-01 | 2 |
C1 | 2022-01-01 | 3 |
A1 | 2022-01-02 | NA |
C1 | 2022-01-02 | NA |
A1 | 2022-01-03 | NA |
B1 | 2022-01-03 | NA |
C1 | 2022-01-03 | NA |
Each row represents an incident.
Now I want to identify if an incident already existed the last time we ran this script. For this it should check the actual Date and compare it with the last existing date in the table.
Note: it can occur that the last day was not yesterday, it might be that there is a difference up to 7 days.
So the logical is:
- compare the second highest
date
value in thisID
group with the second highestDate
from fulldf
- If it's the same it means, case already exists. Then take it's last
Case
number. If it's new, create a newCase
number (max(Case) 1
)
Update 11.05.2022 - 17:02:
- It should consider existing
Case
values and not overwrite them. Or with other words, it should overwrite/fill theNA
s. There will never beNA
s in between. The existing cases always have a number, the new ones doesn't. 100%.
Expected result:
ID | Date | Case | Comment |
---|---|---|---|
A1 | 2022-01-01 | 1 | |
B1 | 2022-01-01 | 2 | |
C1 | 2022-01-01 | 3 | |
A1 | 2022-01-02 | 1 | |
C1 | 2022-01-02 | 3 | |
A1 | 2022-01-03 | 1 | |
B1 | 2022-01-03 | 4 | New case, as there wasn't B1 on 2022-01-02 |
C1 | 2022-01-03 | 3 |
I was able to identify the second highest Dates:
> df[, nth(unique(Date),length(unique(Date))-1), ID]
ID V1
1: A1 2022-01-02 ## TRUE, as it's the second highest Date
2: B1 2022-01-01 ## FALSE, as it's not the second highest Date
3: C1 2022-01-02 ## TRUE, as it's the second highest Date
> df[, nth(unique(Date),length(unique(Date))-1)]
[1] "2022-01-02" ## Second highest Date in df
But now I'm struggling with creating a new column with this condition. Can somebody please help? data.table
solution prefered, but dplyr
is great as well.
MWE
library(data.table)
df = data.table(ID=c("A1", "B1", "C1", "A1", "C1", "A1", "B1", "C1"),
Date=as.Date(c("2022-01-01","2022-01-01","2022-01-01","2022-01-02","2022-01-02","2022-01-03", "2022-01-03", "2022-01-03")),
Case = NA)
Goal = data.table(ID=c("A1", "B1", "C1", "A1", "C1", "A1", "B1", "C1"),
Date=as.Date(c("2022-01-01","2022-01-01","2022-01-01","2022-01-02","2022-01-02","2022-01-03", "2022-01-03", "2022-01-03")),
Case=c(1,2,3,1,3,1,4,3))
CodePudding user response:
How about this:
df[order(Date), d:=c(1,diff(Date)), by = ID][
order(d,ID),case:=rleid(ID,d)][
,d:=NULL]
Output:
ID Date case
1: A1 2022-01-01 1
2: B1 2022-01-01 2
3: C1 2022-01-01 3
4: A1 2022-01-02 1
5: C1 2022-01-02 3
6: A1 2022-01-03 1
7: B1 2022-01-03 4
8: C1 2022-01-03 3
If you actually want the comment column, you can refine the above, like this:
df[order(Date), d:=c(1,diff(Date)), by = ID][
order(d,ID),`:=`(
case=rleid(ID,d),
comment=fifelse(d!=1,paste0("New case, as there was no ", ID, " on ",Date-1),""))][
,d:=NULL][]
Output:
ID Date case comment
1: A1 2022-01-01 1
2: B1 2022-01-01 2
3: C1 2022-01-01 3
4: A1 2022-01-02 1
5: C1 2022-01-02 3
6: A1 2022-01-03 1
7: B1 2022-01-03 4 New case, as there was no B1 on 2022-01-02
8: C1 2022-01-03 3