I have a dataframe of the form:
ID | Rank | Var1 | Var2 |
---|---|---|---|
a | 3 | na | 6 |
b | 2 | 3 | na |
a | 1 | na | 5 |
a | 2 | 2 | 1 |
b | 1 | 7 | 1 |
I am trying to apply a sequence of operations:
- Group by ID column
- Sort by Rank Column descending
- For each variable column, select the highest ranked non na value So the output would be
ID | Var1 | Var 2 |
---|---|---|
a | 2 | 6 |
b | 3 | 1 |
So far I have
df %>% group_by(ID) %>% arange(desc(Rank))
But am unsure how to then filter each column for non NA and select the highest. The NA filter should be applied on a column basis only - if Var 1 has na, it should be excluded from the calculation of Var 1, but not for Var2.
CodePudding user response:
Select the first non-NA value across the selected variables after grouping:
library(dplyr)
df %>%
na_if("na") %>%
group_by(ID) %>%
arrange(desc(Rank), .by_group = T) %>%
summarise(across(Var1:Var2, ~ head(.x[!is.na(.x)], 1)))
# A tibble: 2 × 3
ID Var1 Var2
<chr> <chr> <chr>
1 a 2 6
2 b 3 1
or with first
:
df %>%
na_if("na") %>%
group_by(ID) %>%
summarise(across(Var1:Var2, ~ first(.x[!is.na(.x)], order_by = "Rank")))
data
df <- read.table(header = T, text = "ID Rank Var1 Var2
a 3 na 6
b 2 3 na
a 1 na 5
a 2 2 1
b 1 7 1")
CodePudding user response:
library(tidyverse)
tribble(
~ID, ~Rank, ~Var1, ~Var2,
"a", 3, NA, 6,
"b", 2, 3, NA,
"a", 1, NA, 5,
"a", 2, 2, 1,
"b", 1, 7, 1
) |>
group_by(ID) |>
arrange(ID, desc(Rank)) |>
fill(everything(), .direction = "up") |>
summarise(across(starts_with("Var"), first))
#> # A tibble: 2 × 3
#> ID Var1 Var2
#> <chr> <dbl> <dbl>
#> 1 a 2 6
#> 2 b 3 1
Created on 2022-05-11 by the reprex package (v2.0.1)