I have a very large dataframe (sample below) with data from multiple sources. I need to select the maximum count for each wyear at each site. Further complicating things, is there are multiple data sources that very in reliability so before I can pick the max I need to select the most accurate datasource for each wyear and then the maximum count from that datasource each wyear and site. The order of accuracy for datasource is: kate > alan > mark > ron.
wyear count site datasource
1 2018 5 A kate
2 2018 10 A alan
3 2020 20 A kate
4 2021 20 A kate
5 2021 5 A alan
6 2017 8 B mark
7 2018 89 B alan
8 2018 50 B mark
9 2019 49 B alan
10 2019 25 B mark
11 2019 35 B ron
12 2020 62 B alan
13 2018 20 C kate
14 2018 5 C mark
15 2018 8 C kate
16 2019 89 C mark
17 2020 50 C alan
18 2020 49 C mark
19 2021 25 C alan
20 2021 25 C kate
So I should end up with the following:
wyear count site datasource
1 2018 5 A kate
3 2020 20 A kate
4 2021 20 A kate
6 2017 8 B mark
7 2018 89 B alan
9 2019 49 B alan
12 2020 62 B alan
13 2018 20 C kate
16 2019 89 C mark
17 2020 50 C alan
20 2021 25 C kate
I know how I would go about getting the maximum per site and year but I don't know how to first grab based on datasource:
data1<-data %>%
group_by(site, wyear)%>%
mutate(max.count.at.site = max(count))
CodePudding user response:
We can convert your datasource
column to a factor with the levels in the order of accuracy, and then sort the data by accuracy and count within each group, and take the top row in each group.
data %>%
mutate(datasource = factor(datasource, levels = c("kate", "alan", "mark", "ron"))) %>%
group_by(site, wyear) %>%
arrange(datasource, desc(count), .by_group = TRUE) %>%
slice(1) %>%
ungroup()
# # A tibble: 11 × 4
# wyear count site datasource
# <int> <int> <chr> <fct>
# 1 2018 5 A kate
# 2 2020 20 A kate
# 3 2021 20 A kate
# 4 2017 8 B mark
# 5 2018 89 B alan
# 6 2019 49 B alan
# 7 2020 62 B alan
# 8 2018 20 C kate
# 9 2019 89 C mark
# 10 2020 50 C alan
# 11 2021 25 C kate
CodePudding user response:
If you start with prioritizing the datasource
, summarizing by year/source, you can then move on to selecting just one per year.
prio <- c("kate", "alan", "mark", "ron")
library(dplyr)
quux %>%
group_by(site, wyear, datasource) %>%
summarize(maxcount = sum(count)) %>%
arrange(wyear, match(datasource, prio, nomatch = 99)) %>%
group_by(wyear, site) %>%
slice(1) %>%
ungroup()
# # A tibble: 11 x 4
# site wyear datasource maxcount
# <chr> <int> <chr> <int>
# 1 B 2017 mark 8
# 2 A 2018 kate 5
# 3 B 2018 alan 89
# 4 C 2018 kate 28
# 5 B 2019 alan 49
# 6 C 2019 mark 89
# 7 A 2020 kate 20
# 8 B 2020 alan 62
# 9 C 2020 alan 50
# 10 A 2021 kate 20
# 11 C 2021 kate 25
Data
quux <- structure(list(wyear = c(2018L, 2018L, 2020L, 2021L, 2021L, 2017L, 2018L, 2018L, 2019L, 2019L, 2019L, 2020L, 2018L, 2018L, 2018L, 2019L, 2020L, 2020L, 2021L, 2021L), count = c(5L, 10L, 20L, 20L, 5L, 8L, 89L, 50L, 49L, 25L, 35L, 62L, 20L, 5L, 8L, 89L, 50L, 49L, 25L, 25L), site = c("A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B", "B", "C", "C", "C", "C", "C", "C", "C", "C"), datasource = c("kate", "alan", "kate", "kate", "alan", "mark", "alan", "mark", "alan", "mark", "ron", "alan", "kate", "mark", "kate", "mark", "alan", "mark", "alan", "kate")), class = "data.frame", row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20"))
CodePudding user response:
Thank you everyone. You helped me get where I needed to go. Below is the code I tweaked based on the answers that produces the desired result:
prio <- c("kate", "alan", "mark", "ron")
library(dplyr)
quux<-data %>%
group_by(site, wyear, datasource) %>%
summarize(maxcount = max(count)) %>%
arrange(site,wyear, match(datasource, prio, nomatch = 99)) %>%
group_by(wyear,site) %>%
slice(1) %>%
ungroup()
quux