I am trying to remove duplicates but need to retain the first data from 1 row and another data from another row/s in [R] programming.
For example, I have data like this:
ID | Transaction # | Process # | Start Date | End Date |
---|---|---|---|---|
ID1 | ID1_T1 | P1 | 2022-01-03 | 2022-01-05 |
ID1 | ID1_T1 | P2 | 2022-01-06 | 2022-01-10 |
ID1 | ID1_T1 | P3 | 2022-01-11 | 2022-01-20 |
ID1 | ID1_T1 | P4 | 2022-01-21 | 2022-01-31 |
ID2 | ID2_T1 | P1 | 2022-01-06 | 2022-01-11 |
ID2 | ID2_T1 | P2 | 2022-01-12 | 2022-01-18 |
ID2 | ID2_T1 | P3 | 2022-01-19 | 2022-01-25 |
ID3 | ID3_T1 | P1 | 2022-01-21 | 2022-01-22 |
And I need to simplify it into this:
ID | Transaction # | Process # | Start Date | End Date |
---|---|---|---|---|
ID1 | ID1_T1 | P1 | 2022-01-03 | 2022-01-31 |
ID2 | ID2_T1 | P1 | 2022-01-06 | 2022-01-25 |
ID3 | ID3_T1 | P1 | 2022-01-11 | 2022-01-22 |
Retaining only the first Start Date and last End Date of the transaction per unique ID and Transaction #.
I'm expecting help on how to code this in R using the tidyverse library since I'm not yet familiar with Data.table.
Thanks very much.
CodePudding user response:
You can use group_by
summarise
:
library(dplyr)
df %>%
group_by(ID, Transaction) %>%
summarise(Process = first(Process),
Start_Date = first(Start_Date), # or min(Start_Date)
End_Date = last(End_Date), # or max(End_Date)
.groups = "drop")