Home > OS >  How to retain only one record but extract data from several two rows for a unique reference number
How to retain only one record but extract data from several two rows for a unique reference number

Time:12-26

I am trying to remove duplicates but need to retain the first data from 1 row and another data from another row/s in [R] programming.

For example, I have data like this:

ID Transaction # Process # Start Date End Date
ID1 ID1_T1 P1 2022-01-03 2022-01-05
ID1 ID1_T1 P2 2022-01-06 2022-01-10
ID1 ID1_T1 P3 2022-01-11 2022-01-20
ID1 ID1_T1 P4 2022-01-21 2022-01-31
ID2 ID2_T1 P1 2022-01-06 2022-01-11
ID2 ID2_T1 P2 2022-01-12 2022-01-18
ID2 ID2_T1 P3 2022-01-19 2022-01-25
ID3 ID3_T1 P1 2022-01-21 2022-01-22

And I need to simplify it into this:

ID Transaction # Process # Start Date End Date
ID1 ID1_T1 P1 2022-01-03 2022-01-31
ID2 ID2_T1 P1 2022-01-06 2022-01-25
ID3 ID3_T1 P1 2022-01-11 2022-01-22

Retaining only the first Start Date and last End Date of the transaction per unique ID and Transaction #.

I'm expecting help on how to code this in R using the tidyverse library since I'm not yet familiar with Data.table.

Thanks very much.

CodePudding user response:

You can use group_by summarise:

library(dplyr)

df %>%
  group_by(ID, Transaction) %>%
  summarise(Process = first(Process),
            Start_Date = first(Start_Date), # or min(Start_Date)
            End_Date = last(End_Date),      # or max(End_Date)
            .groups = "drop")
  • Related