Home > front end >  how to select rows based on a hierarchical order of values in another column
how to select rows based on a hierarchical order of values in another column

Time:10-07

I have a very large dataframe (sample below) with data from multiple sources. I need to select the maximum count for each wyear at each site. Further complicating things, is there are multiple data sources that very in reliability so before I can pick the max I need to select the most accurate datasource for each wyear and then the maximum count from that datasource each wyear and site. The order of accuracy for datasource is: kate > alan > mark > ron.

  wyear count site datasource
1   2018     5    A kate
2   2018    10    A alan
3   2020    20    A kate
4   2021    20    A kate
5   2021     5    A alan
6   2017     8    B mark
7   2018    89    B alan
8   2018    50    B mark
9   2019    49    B alan
10  2019    25    B mark
11  2019    35    B  ron
12  2020    62    B alan
13  2018    20    C kate
14  2018     5    C mark
15  2018     8    C kate
16  2019    89    C mark
17  2020    50    C alan
18  2020    49    C mark
19  2021    25    C alan
20  2021    25    C kate

So I should end up with the following:

wyear count site datasource
1   2018     5    A kate
3   2020    20    A kate
4   2021    20    A kate
6   2017     8    B mark
7   2018    89    B alan
9   2019    49    B alan
12  2020    62    B alan
13  2018    20    C kate
16  2019    89    C mark
17  2020    50    C alan
20  2021    25    C kate

I know how I would go about getting the maximum per site and year but I don't know how to first grab based on datasource:

data1<-data %>%
  group_by(site, wyear)%>%
  mutate(max.count.at.site = max(count))

CodePudding user response:

We can convert your datasource column to a factor with the levels in the order of accuracy, and then sort the data by accuracy and count within each group, and take the top row in each group.

data %>%
  mutate(datasource = factor(datasource, levels = c("kate", "alan", "mark", "ron"))) %>%
  group_by(site, wyear) %>%
  arrange(datasource, desc(count), .by_group = TRUE) %>%
  slice(1) %>%
  ungroup()
# # A tibble: 11 × 4
#    wyear count site  datasource
#    <int> <int> <chr> <fct>     
#  1  2018     5 A     kate      
#  2  2020    20 A     kate      
#  3  2021    20 A     kate      
#  4  2017     8 B     mark      
#  5  2018    89 B     alan      
#  6  2019    49 B     alan      
#  7  2020    62 B     alan      
#  8  2018    20 C     kate      
#  9  2019    89 C     mark      
# 10  2020    50 C     alan      
# 11  2021    25 C     kate      

CodePudding user response:

If you start with prioritizing the datasource, summarizing by year/source, you can then move on to selecting just one per year.

prio <- c("kate", "alan", "mark", "ron")
library(dplyr)
quux %>%
  group_by(site, wyear, datasource) %>%
  summarize(maxcount = sum(count)) %>%
  arrange(wyear, match(datasource, prio, nomatch = 99)) %>%
  group_by(wyear, site) %>%
  slice(1) %>%
  ungroup()
# # A tibble: 11 x 4
#    site  wyear datasource maxcount
#    <chr> <int> <chr>         <int>
#  1 B      2017 mark              8
#  2 A      2018 kate              5
#  3 B      2018 alan             89
#  4 C      2018 kate             28
#  5 B      2019 alan             49
#  6 C      2019 mark             89
#  7 A      2020 kate             20
#  8 B      2020 alan             62
#  9 C      2020 alan             50
# 10 A      2021 kate             20
# 11 C      2021 kate             25



Data

quux <- structure(list(wyear = c(2018L, 2018L, 2020L, 2021L, 2021L, 2017L, 2018L, 2018L, 2019L, 2019L, 2019L, 2020L, 2018L, 2018L, 2018L, 2019L, 2020L, 2020L, 2021L, 2021L), count = c(5L, 10L, 20L, 20L, 5L, 8L, 89L, 50L, 49L, 25L, 35L, 62L, 20L, 5L, 8L, 89L, 50L, 49L, 25L, 25L), site = c("A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B", "B", "C", "C", "C", "C", "C", "C", "C", "C"), datasource = c("kate", "alan", "kate", "kate", "alan", "mark", "alan", "mark", "alan", "mark", "ron", "alan", "kate", "mark",  "kate", "mark", "alan", "mark", "alan", "kate")), class = "data.frame", row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20"))

CodePudding user response:

Thank you everyone. You helped me get where I needed to go. Below is the code I tweaked based on the answers that produces the desired result:

prio <- c("kate", "alan", "mark", "ron")
library(dplyr)
quux<-data %>%
  group_by(site, wyear, datasource) %>%
  summarize(maxcount = max(count)) %>%
  arrange(site,wyear, match(datasource, prio, nomatch = 99)) %>%
  group_by(wyear,site) %>%
  slice(1) %>%
  ungroup()
quux
  • Related