I am trying to create a new table where table1 is grouped by ID and only the most recent date is picked from table1$new.
I have the tables below:
1st Table:
ID | Status | Date |
---|---|---|
001 | 1003 | 2021-02-01 |
001 | 1003 | 2021-02-01 |
001 | 1003 | 2021-02-21 |
001 | 1003 | 2021-04-21 |
002 | 1003 | 2021-05-01 |
002 | 1003 | 2021-05-01 |
002 | 1003 | 2021-05-21 |
002 | 1003 | 2021-07-21 |
003 | 1004 | 2021-05-01 |
003 | 1004 | 2021-05-01 |
003 | 1004 | 2021-05-21 |
003 | 1004 | 2021-07-21 |
Current Output
ID | Status | Date |
---|---|---|
001 | 1003 | 2021-02-01 |
001 | 1003 | 2021-02-01 |
001 | 1003 | 2021-02-21 |
001 | 1003 | 2021-04-21 |
002 | 1003 | 2021-05-01 |
002 | 1003 | 2021-05-01 |
002 | 1003 | 2021-05-21 |
002 | 1003 | 2021-07-21 |
Desired Output:
ID | New |
---|---|
001 | 2021-4-21 |
002 | 2021-07-21 |
Here is my code as of now
desired <- table1 %>%
group_by(ID) %>%
subset(Status == '1003', select = c("ID", "Date")) %>% rename(New = Date)
I think my first mistake is in the group_by and I am not sure how to only keep the most recent date.
CodePudding user response:
I would use dplyr::slice_max()
to select the most recent date.
library(tidyverse)
d <- tibble::tribble(
~ID, ~Status, ~Date,
"001", 1003L, "2021-02-01",
"001", 1003L, "2021-02-01",
"001", 1003L, "2021-02-21",
"001", 1003L, "2021-04-21",
"002", 1003L, "2021-05-01",
"002", 1003L, "2021-05-01",
"002", 1003L, "2021-05-21",
"002", 1003L, "2021-07-21",
"003", 1004L, "2021-05-01",
"003", 1004L, "2021-05-01",
"003", 1004L, "2021-05-21",
"003", 1004L, "2021-07-21"
)
res <- d |>
mutate(Date = lubridate::ymd(Date)) |>
dplyr::filter(Status == 1003) |>
group_by(ID) |>
slice_max(Date, n = 1) |>
ungroup() |>
select(ID, New = Date) # rename variable in the selection of variables
res
#> # A tibble: 2 x 2
#> ID New
#> <chr> <date>
#> 1 001 2021-04-21
#> 2 002 2021-07-21
Created on 2022-06-15 by the reprex package (v2.0.1)