Home > Software engineering >  How to delete entire rows from a dataframe based on the date the data was collected?
How to delete entire rows from a dataframe based on the date the data was collected?

Time:01-05

Let's say I have this example dataframe (but a lot bigger)

df = data.frame(ID_number = c(111,111,111,22,22,33,33),
                date = c('2021-06-14','2021-06-12','2021-03-11',
                         '2021-05-20','2021-05-14',
                         '2018-04-20','2017-03-14'),
                answers = 1:7,
                sex = c('F','M','F','M','M','M','F') )

The output

  ID_number       date  answers sex
1       111 2021-06-14       1   F
2       111 2021-06-12       2   M
3       111 2021-03-11       3   F
4        22 2021-05-20       4   M
5        22 2021-05-14       5   M
6        33 2018-04-20       6   M
7        33 2017-03-14       7   F

we can see that there are 7 different members, but the one who created the dataframe has made a mistake and assigned the same ID_number to members 1,2 and 3. The same ID_number to members 4 and 5 and so on ...

In the dataframe there is the data of the collection of the data of each member and I wish to only keep the member that has the earliest date. The resulted dataframe would look like this

  ID_number       date   answers sex
1       111 2021-03-11       3   F
2        22 2021-05-14       5   M
3        33 2017-03-14       7   F

Appreciate the help.

CodePudding user response:

You could filter on the min date per group_by like this:

library(dplyr)
df %>%
  group_by(ID_number) %>%
  filter(date == min(date))
#> # A tibble: 3 × 4
#> # Groups:   ID_number [3]
#>   ID_number date       answers sex  
#>       <dbl> <chr>        <int> <chr>
#> 1       111 2021-03-11       3 F    
#> 2        22 2021-05-14       5 M    
#> 3        33 2017-03-14       7 F

Created on 2023-01-04 with reprex v2.0.2

CodePudding user response:

With slice_min:

library(dplyr)
df %>% 
  group_by(ID_number) %>% 
  slice_min(date)

In the dev. version, you can use inline grouping with .by:

devtools::install_github("tidyverse/dplyr")
df %>% 
  slice_min(date, .by = ID_number)

CodePudding user response:

Using base R

subset(df, as.numeric(date) == ave(as.numeric(date), ID_number, FUN = min))
  ID_number       date answers sex
3       111 2021-03-11       3   F
5        22 2021-05-14       5   M
7        33 2017-03-14       7   F
  • Related