I have a data frame df
, it is in this format:
id | version | column_3 | column_4 |
---|---|---|---|
A0001 | v02_abc_1 | 1 | NA |
A0002 | v01_abc_1 | NA |
0 |
A0002 | v02_abc_1 | 0 | NA |
A0002 | v03_abc_1 | 0 | 1 |
A0003 | v01_abc_1 | NA |
1 |
How do I transform it into:
id | version | column_3 | column_4 |
---|---|---|---|
A0001 | v02_abc_1 | 1 | NA |
A0002 | v03_abc_1 | 0 | 1 |
A0003 | v01_abc_1 | NA |
1 |
Where the row with the latest version for each id
is kept and the rest are removed? NOT all id
s have three versions; some have two, and some have only one version.
Assume that the data has hundreds of rows.
CodePudding user response:
We may use
library(dplyr)
df1 %>%
arrange(id, version) %>%
group_by(id) %>%
slice_tail(n = 1) %>%
ungroup
-output
# A tibble: 3 × 4
id version column_3 column_4
<chr> <chr> <int> <int>
1 A0001 v02_abc_1 1 NA
2 A0002 v03_abc_1 0 1
3 A0003 v01_abc_1 NA 1
data
df1 <- structure(list(id = c("A0001", "A0002", "A0002", "A0002", "A0003"
), version = c("v02_abc_1", "v01_abc_1", "v02_abc_1", "v03_abc_1",
"v01_abc_1"), column_3 = c(1L, NA, 0L, 0L, NA), column_4 = c(NA,
0L, NA, 1L, 1L)), class = "data.frame", row.names = c(NA, -5L
))
CodePudding user response:
You could slice the last row per group using slice_max
like this:
library(dplyr)
df %>%
group_by(id) %>%
slice_max(order_by = version, n = 1)
#> # A tibble: 3 × 4
#> # Groups: id [3]
#> id version column_3 column_4
#> <chr> <chr> <int> <int>
#> 1 A0001 v02_abc_1 1 NA
#> 2 A0002 v03_abc_1 0 1
#> 3 A0003 v01_abc_1 NA 1
Created on 2023-01-06 with reprex v2.0.2
Data:
df <- read.table(text = "id version column_3 column_4
A0001 v02_abc_1 1 NA
A0002 v01_abc_1 NA 0
A0002 v02_abc_1 0 NA
A0002 v03_abc_1 0 1
A0003 v01_abc_1 NA 1
", header = TRUE)