I met a problem when I am dealing with value change by using lead()
.
df <- data.frame(id = c(11,11,11,11,12,12,12,12,13,13,13,13),
drug_type = c("A","A","B","C","B","B","B","B","A","A","B","B"),
drug_brand = c(1,1,2,2,2,3,3,3,1,1,2,3),
date = c("2020-01-01","2020-02-01","2020-03-01","2020-03-13",
"2019-04-05","2019-05-02","2019-06-03","2019-08-04",
"2021-02-02","2021-02-27","2021-03-22","2021-04-11"))
As for the data frame shown above, I aim to identify the change in drug type for each patient, if there is no change in the drug type for a patient, then they will be ignored. To achieve it I used
df %>%
group_by(id) %>%
filter(lead(drug_type)=="B")
However, it only returned the corresponding rows whose next value is "B", while I want to also keep the rows containing changed values. For example, in terms of patient no.11, the result only returned me the record on the "2020-02-01", but I hope to find a way to keep the record on the "2020-03-01" as well, that is, the changed value.
I tried some ways, including creating extra columns named "next" showing the next values of each value in the drug type, but they are either wrong or excessively redundant. Therefore, I want to know if there is any better way to achieve my goal.
CodePudding user response:
You can check if drug type differs from lead()
or if drug type differs from lag()
df %>%
group_by(id) %>%
filter(
(drug_type!=lag(drug_type)) |
drug_type!=lead(drug_type)
)
Output:
id drug_type drug_brand date
<dbl> <chr> <dbl> <chr>
1 11 A 1 2020-02-01
2 11 B 2 2020-03-01
3 11 C 2 2020-03-13
4 13 A 1 2021-02-27
5 13 B 2 2021-03-22
If you only want rows that change to "B" plus the row that has that new "B" value, you could do something like this:
df %>% group_by(id) %>%
filter((drug_type!="B" & lead(drug_type)=="B") | drug_type=="B" & lag(drug_type)!="B")
Output:
id drug_type drug_brand date
<dbl> <chr> <dbl> <chr>
1 11 A 1 2020-02-01
2 11 B 2 2020-03-01
3 13 A 1 2021-02-27
4 13 B 2 2021-03-22