My problem seems rather trivial but appearantly I didn't come up with the appropriate search terms.
My data is like this:
data <- data.frame(ID = c(1,1,2,3,3),
V1 = c("A","B","A","B","C"),
V2 = c("C","B",NA,"B","A"),
V3 = c("A","B","C","B",NA))
I want to coalesce or merge the rows by ID and keep only one row per ID with the "highest" value in each column. In my example, I would like to prioritize C over B over A.
After the desired operation, my data would look like this:
| ID | V1 | V2 | V3 |
| -- | -- | -- | -- |
| 1 | B | C | B |
| 2 | A | NA | C |
| 3 | C | B | B |
Any hints would be greatly appreciated! Dplyr is preferred, but no necessity. Thanks!
Edit: The solutions (thank you for that!) all took advantage of the fact that letters are "ordered" in R.
Let us take this example data:
data <- data.frame(ID = c(1,1,2,3,3),
V1 = c("yes","no","yes","no","unsure"),
V2 = c("unsure","no",NA,"no","yes"),
V3 = c("yes","no","unsure","no",NA))
Where the desired outcome is that the priority is "yes" over "no" over "unsure".
CodePudding user response:
Since we can can get the maximum for a letter from the alphabet, We could use:
library(tidyverse)
data %>%
group_by(ID) %>%
summarize(across(everything(), ~ max(., na.rm = TRUE)))
Which gives:
# A tibble: 3 x 4
ID V1 V2 V3
<dbl> <chr> <chr> <chr>
1 1 B C B
2 2 A <NA> C
3 3 C B B
CodePudding user response:
Here's a solution in base
:
aggregate(data[,-1], by = list(ID=data$ID), FUN = max, na.rm = T)
# ID V1 V2 V3
# 1 1 B C B
# 2 2 A <NA> C
# 3 3 C B B
CodePudding user response:
EDIT: added simpler dplyr-only
library(dplyr)
data %>%
group_by(ID) %>%
summarize(across(V1:V3, max))
# A tibble: 3 × 4
ID V1 V2 V3
<dbl> <chr> <chr> <chr>
1 1 B C B
2 2 A NA C
3 3 C B NA
Earlier solution using reshaping from tidyr.
library(dplyr); library(tidyr)
data %>%
pivot_longer(-ID) %>%
group_by(ID, name) %>%
slice_max(value) %>%
ungroup() %>%
pivot_wider(names_from = name)
ID V1 V2 V3
<dbl> <chr> <chr> <chr>
1 1 B C B
2 2 A NA C
3 3 C B B
If you want ordered factors, here's an approach where we specify the ordering, apply that to the data in V1:V3, and then proceed as before.
data <- data.frame(ID = c(1,1,2,3,3),
V1 = c("yes","no","yes","no","unsure"),
V2 = c("unsure","no",NA,"no","yes"),
V3 = c("yes","no","unsure","no",NA))
var_order <- c("yes", "no", "unsure")
data %>%
mutate(across(V1:V3, ~factor(.x, levels = var_order))) %>%
pivot_longer(-ID) %>%
group_by(ID, name) %>%
slice_min(value) %>%
ungroup() %>%
pivot_wider(names_from = name)
# A tibble: 3 × 4
ID V1 V2 V3
<dbl> <fct> <fct> <fct>
1 1 yes no yes
2 2 yes NA unsure
3 3 no yes no