Keep the rows with the latest version. See example

I have a data frame df, it is in this format:

id	version	column_3	column_4
A0001	v02_abc_1	1	`NA`
A0002	v01_abc_1	`NA`	0
A0002	v02_abc_1	0	`NA`
A0002	v03_abc_1	0	1
A0003	v01_abc_1	`NA`	1

How do I transform it into:

id	version	column_3	column_4
A0001	v02_abc_1	1	`NA`
A0002	v03_abc_1	0	1
A0003	v01_abc_1	`NA`	1

Where the row with the latest version for each id is kept and the rest are removed? NOT all ids have three versions; some have two, and some have only one version.

Assume that the data has hundreds of rows.

CodePudding user response：

We may use

library(dplyr)
df1 %>% 
   arrange(id, version) %>%
   group_by(id) %>%
   slice_tail(n = 1) %>%
   ungroup

-output

# A tibble: 3 × 4
  id    version   column_3 column_4
  <chr> <chr>        <int>    <int>
1 A0001 v02_abc_1        1       NA
2 A0002 v03_abc_1        0        1
3 A0003 v01_abc_1       NA        1

data

df1 <- structure(list(id = c("A0001", "A0002", "A0002", "A0002", "A0003"
), version = c("v02_abc_1", "v01_abc_1", "v02_abc_1", "v03_abc_1", 
"v01_abc_1"), column_3 = c(1L, NA, 0L, 0L, NA), column_4 = c(NA, 
0L, NA, 1L, 1L)), class = "data.frame", row.names = c(NA, -5L
))

CodePudding user response：

You could slice the last row per group using slice_max like this:

library(dplyr)

df %>%
  group_by(id) %>%
  slice_max(order_by = version, n = 1)
#> # A tibble: 3 × 4
#> # Groups:   id [3]
#>   id    version   column_3 column_4
#>   <chr> <chr>        <int>    <int>
#> 1 A0001 v02_abc_1        1       NA
#> 2 A0002 v03_abc_1        0        1
#> 3 A0003 v01_abc_1       NA        1

^{Created on 2023-01-06 with reprex v2.0.2}

Data:

df <- read.table(text = "id version column_3    column_4
A0001   v02_abc_1   1   NA
A0002   v01_abc_1   NA  0
A0002   v02_abc_1   0   NA
A0002   v03_abc_1   0   1
A0003   v01_abc_1   NA  1
", header = TRUE)