I want to create a new identifier column in a dataset I have.
ex <- structure(list(id = c("8210109300002", "8210109300002", "8210109300002",
"8210109300002", "8210109300002", "8210109300002", "8210109300002",
"8210109300002", "8210109300002"), serv_from_dt = structure(c(18262,
18263, 18267, 18267, 18268, 18269, 18269, 18275, 18276), class = "Date"),
serv_to_dt = structure(c(18262, 18263, 18267, 18267, 18268,
18269, 18269, 18275, 18276), class = "Date"), date_plus1 = structure(c(18263,
18264, 18268, 18268, 18269, 18270, 18270, 18276, 18277), class = "Date")),
row.names = c(NA, -9L), class = c("data.table", "data.frame"))
This identifier would be based on the serv_to_date, serv_from_date, and date_plus1 columns. The data is ordered by serv_from_date; if the next row has a ser_to_date that is equal to the previous row's serv_from_date OR a serv_to_date that is equal to the previous row's serv_from_date 1 (which is the date_plus1 column) then label those rows with 1 identifier.
The final output I'd like is:
want <- structure(list(id = c("8210109300002", "8210109300002", "8210109300002",
"8210109300002", "8210109300002", "8210109300002", "8210109300002",
"8210109300002", "8210109300002"), serv_from_dt = structure(c(18262,
18263, 18267, 18267, 18268, 18269, 18269, 18275, 18276), class = "Date"),
serv_to_dt = structure(c(18262, 18263, 18267, 18267, 18268,
18269, 18269, 18275, 18276), class = "Date"), date_plus1 = structure(c(18263,
18264, 18268, 18268, 18269, 18270, 18270, 18276, 18277), class = "Date"),
identifier = c("1", "1", "2",
"2", "2", "2", "2",
"3", "3")), row.names = c(NA, -9L), class = c("data.table", "data.frame"))
My first step was to create a column that identifies lag dates with the previous row's dates:
ex %>%
mutate(NewCol = ifelse((lag(serv_from_dt) == date_plus1 | lag(serv_from_dt) == serv_to_dt), "yes", "no"))
however, this code does not correctly say "yes" to serv_from_date's that match previous row's date_plus1.
Thanks in advance for any help you can provide!
CodePudding user response:
Your logic is good, you're just missing the final step: we need to do a cumulative count of the "yes" values, with cumsum
.
And actually we can simplify if we skip the ifelse
and leave the result as TRUE/FALSE instead of "yes"/"no", and use a nice default value to make sure the first row is TRUE.
want %>%
mutate(NewCol = cumsum(
lag(serv_from_dt, default = first(date_plus1)) == date_plus1 |
lag(serv_from_dt) == serv_to_dt)
)
# id serv_from_dt serv_to_dt date_plus1 identifier NewCol
# 1 8210109300002 2020-01-01 2020-01-01 2020-01-02 1 1
# 2 8210109300002 2020-01-02 2020-01-02 2020-01-03 1 1
# 3 8210109300002 2020-01-06 2020-01-06 2020-01-07 2 1
# 4 8210109300002 2020-01-06 2020-01-06 2020-01-07 2 2
# 5 8210109300002 2020-01-07 2020-01-07 2020-01-08 2 2
# 6 8210109300002 2020-01-08 2020-01-08 2020-01-09 2 2
# 7 8210109300002 2020-01-08 2020-01-08 2020-01-09 2 3
# 8 8210109300002 2020-01-14 2020-01-14 2020-01-15 3 3
# 9 8210109300002 2020-01-15 2020-01-15 2020-01-16 3 3
CodePudding user response:
With data.table
:
library(data.table)
setDT(ex)
ex[,identifier:=cumsum(!(serv_to_dt == shift(serv_from_dt,1,fill = FALSE)|serv_to_dt == shift(serv_from_dt,1,fill=FALSE) 1))][]
id serv_from_dt serv_to_dt date_plus1 identifier
1: 8210109300002 2020-01-01 2020-01-01 2020-01-02 1
2: 8210109300002 2020-01-02 2020-01-02 2020-01-03 1
3: 8210109300002 2020-01-06 2020-01-06 2020-01-07 2
4: 8210109300002 2020-01-06 2020-01-06 2020-01-07 2
5: 8210109300002 2020-01-07 2020-01-07 2020-01-08 2
6: 8210109300002 2020-01-08 2020-01-08 2020-01-09 2
7: 8210109300002 2020-01-08 2020-01-08 2020-01-09 2
8: 8210109300002 2020-01-14 2020-01-14 2020-01-15 3
9: 8210109300002 2020-01-15 2020-01-15 2020-01-16 3
CodePudding user response:
The following logic using cumsum
will only increment when serv_to_dt
is not equal to either the lag value of serv_from_dt
and date_plus1
. row_number() == 1
starts the cumulative sum at 1.
library(dplyr)
ex %>%
mutate(identifier = cumsum((serv_to_dt != lag(serv_from_dt) & serv_to_dt != lag(date_plus1)) | row_number() == 1))
Output
id serv_from_dt serv_to_dt date_plus1 identifier
1 8210109300002 2020-01-01 2020-01-01 2020-01-02 1
2 8210109300002 2020-01-02 2020-01-02 2020-01-03 1
3 8210109300002 2020-01-06 2020-01-06 2020-01-07 2
4 8210109300002 2020-01-06 2020-01-06 2020-01-07 2
5 8210109300002 2020-01-07 2020-01-07 2020-01-08 2
6 8210109300002 2020-01-08 2020-01-08 2020-01-09 2
7 8210109300002 2020-01-08 2020-01-08 2020-01-09 2
8 8210109300002 2020-01-14 2020-01-14 2020-01-15 3
9 8210109300002 2020-01-15 2020-01-15 2020-01-16 3