Filter first date per year with several columns-CodePudding

Been looking for a while without finding answers so try here:

I have a group of data in a column where the first observation of an animal is listed. 2022-05-03. 2022-05-01. 2022-04-23, 2021-05-04, 2021-02-31, 2020-01-30, 2020-05-20 and so on.

I am looking for finding the first observation per year using the filter() function. How is that supposed to like, is the lubridate function something to apply?

Thanks in advance.

CodePudding user response：

Yoy can try:

library(dplyr)
library(lubridate)
df = tibble(date = as.Date(c("2022-05-03", "2022-05-01", "2022-04-23", "2021-05-04", "2021-02-28", "2020-01-30", "2020-05-20")))

Then, to get the first date by year:

df %>% mutate(year = year(date)) %>% arrange(date) %>% group_by(year) %>% slice(1)

Best wishes!

CodePudding user response：

I show you some ways First of all, use "Date" format for dates!

animal_data <- transform(animal_data, date=as.Date(date))

Here an option using aggregate with formula interface, aggregating at animal name and 1-4 substrings of the date, i.e. the year,

aggregate(date ~ animals   substr(date, 1, 4), animal_data, min)
#   animals substr(date, 1, 4)       date
# 1 Gorilla               2020 2020-07-05
# 2  Rhebok               2020 2020-02-22
# 3  Vicuna               2020 2020-06-23
# 4 Gorilla               2021 2021-01-11
# 5  Rhebok               2021 2021-03-10
# 6  Vicuna               2021 2021-05-24
# 7 Gorilla               2022 2022-05-03
# 8  Rhebok               2022 2022-04-29

or with list notation, where we are most flexible regarding the column names of the result.

with(animal_data, aggregate(list(date=date), list(animals=animals, year=substr(date, 1, 4)), min))
#   animals year       date
# 1 Gorilla 2020 2020-07-05
# 2  Rhebok 2020 2020-02-22
# 3  Vicuna 2020 2020-06-23
# 4 Gorilla 2021 2021-01-11
# 5  Rhebok 2021 2021-03-10
# 6  Vicuna 2021 2021-05-24
# 7 Gorilla 2022 2022-05-03
# 8  Rhebok 2022 2022-04-29

Another way is using ave in subset. subset expects a logical condition. ave internally splits the date at animal then at the year and applies which.max on this subset. We compare the output of ave—the first obs of the animal in that year—with the date and in this way create the logical subset.

subset(animal_data, ave(date, animals, substr(date, 1, 4), FUN=\(x) x[which.min(x)]) == date)
#    animals       date
# 1   Rhebok 2020-02-22
# 2   Vicuna 2020-06-23
# 3  Gorilla 2020-07-05
# 12 Gorilla 2021-01-11
# 13  Rhebok 2021-03-10
# 14  Vicuna 2021-05-24
# 19  Rhebok 2022-04-29
# 20 Gorilla 2022-05-03

Now you probably have a few options to choose from.

Data:

animal_data <- structure(list(animals = c("Rhebok", "Vicuna", "Gorilla", "Rhebok", 
"Rhebok", "Gorilla", "Rhebok", "Vicuna", "Vicuna", "Gorilla", 
"Vicuna", "Gorilla", "Rhebok", "Vicuna", "Rhebok", "Rhebok", 
"Rhebok", "Vicuna", "Rhebok", "Gorilla"), date = structure(c(18314, 
18436, 18448, 18487, 18502, 18516, 18549, 18582, 18588, 18589, 
18604, 18638, 18696, 18771, 18806, 18807, 18911, 18938, 19111, 
19115), class = "Date")), row.names = c(8L, 15L, 9L, 18L, 3L, 
20L, 4L, 14L, 7L, 10L, 5L, 17L, 11L, 6L, 19L, 2L, 1L, 13L, 12L, 
16L), class = "data.frame")