Let's say I have this example dataframe (but a lot bigger)
df = data.frame(ID_number = c(111,111,111,22,22,33,33),
date = c('2021-06-14','2021-06-12','2021-03-11',
'2021-05-20','2021-05-14',
'2018-04-20','2017-03-14'),
answers = 1:7,
sex = c('F','M','F','M','M','M','F') )
The output
ID_number date answers sex
1 111 2021-06-14 1 F
2 111 2021-06-12 2 M
3 111 2021-03-11 3 F
4 22 2021-05-20 4 M
5 22 2021-05-14 5 M
6 33 2018-04-20 6 M
7 33 2017-03-14 7 F
we can see that there are 7 different members, but the one who created the dataframe has made a mistake and assigned the same ID_number to members 1,2 and 3. The same ID_number to members 4 and 5 and so on ...
In the dataframe there is the data of the collection of the data of each member and I wish to only keep the member that has the earliest date. The resulted dataframe would look like this
ID_number date answers sex
1 111 2021-03-11 3 F
2 22 2021-05-14 5 M
3 33 2017-03-14 7 F
Appreciate the help.
CodePudding user response:
You could filter
on the min
date per group_by
like this:
library(dplyr)
df %>%
group_by(ID_number) %>%
filter(date == min(date))
#> # A tibble: 3 × 4
#> # Groups: ID_number [3]
#> ID_number date answers sex
#> <dbl> <chr> <int> <chr>
#> 1 111 2021-03-11 3 F
#> 2 22 2021-05-14 5 M
#> 3 33 2017-03-14 7 F
Created on 2023-01-04 with reprex v2.0.2
CodePudding user response:
With slice_min
:
library(dplyr)
df %>%
group_by(ID_number) %>%
slice_min(date)
In the dev. version, you can use inline grouping with .by
:
devtools::install_github("tidyverse/dplyr")
df %>%
slice_min(date, .by = ID_number)
CodePudding user response:
Using base R
subset(df, as.numeric(date) == ave(as.numeric(date), ID_number, FUN = min))
ID_number date answers sex
3 111 2021-03-11 3 F
5 22 2021-05-14 5 M
7 33 2017-03-14 7 F