I have the following data frame structure
date | latitude |
---|---|
1951-03-22 | 66.08106 |
1951-03-22 | 59.59117 |
1951-04-08 | 59.59117 |
1952-10-20 | 55.41972 |
1960-08-12 | 66.05653 |
1960-09-10 | 66.08106 |
What I would like to do is: select the rows for all unique latitude and if there are 2 (or more) exact same latitudes, I want to keep only the one that has the earliest date but for each year.
So, for my previous example, it would give the following subset, without only the 3rd row:
date | latitude |
---|---|
1951-03-22 | 66.08106 |
1951-03-22 | 59.59117 |
1952-10-20 | 55.41972 |
1960-08-12 | 66.05653 |
1960-09-10 | 66.08106 |
Many thanks for the help.
PS: maybe it is important to precise that class(df$date) is "Date" and class(df$latitude) is "numeric".
CodePudding user response:
Grouped by 'latitude' and year
extracted from 'date', use slice_max
to extract the row with max date, and then remove the 'year' column
library(dplyr)
library(lubridate)
df1 %>%
# grouped by latitude, and year extracted from Date class
# year is from lubridate
group_by(latitude, year = year(date)) %>%
# slice 1 row from each group, ordered by the 'date' column
slice_max(n = 1, order_by = date) %>%
# remove the grouping
ungroup %>%
# remove the year column
select(-year) %>%
arrange(date)
-output
# A tibble: 5 × 2
date latitude
<chr> <dbl>
1 1951-03-22 66.1
2 1951-04-08 59.6
3 1952-10-20 55.4
4 1960-08-12 66.1
5 1960-09-10 66.1