Home > OS >  Coalescing/merging rows but retaining "dominant" values
Coalescing/merging rows but retaining "dominant" values

Time:11-11

My problem seems rather trivial but appearantly I didn't come up with the appropriate search terms.

My data is like this:

data <- data.frame(ID = c(1,1,2,3,3),
                   V1 = c("A","B","A","B","C"),
                   V2 = c("C","B",NA,"B","A"),
                   V3 = c("A","B","C","B",NA))

I want to coalesce or merge the rows by ID and keep only one row per ID with the "highest" value in each column. In my example, I would like to prioritize C over B over A.

After the desired operation, my data would look like this:

| ID | V1 | V2 | V3 |
| -- | -- | -- | -- |
|  1 | B  | C  | B  |
|  2 | A  | NA | C  |
|  3 | C  | B  | B  |

Any hints would be greatly appreciated! Dplyr is preferred, but no necessity. Thanks!

Edit: The solutions (thank you for that!) all took advantage of the fact that letters are "ordered" in R.

Let us take this example data:

data <- data.frame(ID = c(1,1,2,3,3),
                   V1 = c("yes","no","yes","no","unsure"),
                   V2 = c("unsure","no",NA,"no","yes"),
                   V3 = c("yes","no","unsure","no",NA))

Where the desired outcome is that the priority is "yes" over "no" over "unsure".

CodePudding user response:

Since we can can get the maximum for a letter from the alphabet, We could use:

library(tidyverse)

data %>%
  group_by(ID) %>%
  summarize(across(everything(), ~ max(., na.rm = TRUE)))

Which gives:

# A tibble: 3 x 4
     ID V1    V2    V3   
  <dbl> <chr> <chr> <chr>
1     1 B     C     B    
2     2 A     <NA>  C    
3     3 C     B     B

CodePudding user response:

Here's a solution in base:

aggregate(data[,-1], by = list(ID=data$ID), FUN = max, na.rm = T)

#   ID V1   V2 V3
# 1  1  B    C  B
# 2  2  A <NA>  C
# 3  3  C    B  B

CodePudding user response:

EDIT: added simpler dplyr-only

library(dplyr)
data %>%
  group_by(ID) %>%
  summarize(across(V1:V3, max))

# A tibble: 3 × 4
     ID V1    V2    V3   
  <dbl> <chr> <chr> <chr>
1     1 B     C     B    
2     2 A     NA    C    
3     3 C     B     NA   

Earlier solution using reshaping from tidyr.

library(dplyr); library(tidyr)
data %>%
  pivot_longer(-ID) %>%
  group_by(ID, name) %>%
  slice_max(value) %>%
  ungroup() %>% 
  pivot_wider(names_from = name)


     ID V1    V2    V3   
  <dbl> <chr> <chr> <chr>
1     1 B     C     B    
2     2 A     NA    C    
3     3 C     B     B  

If you want ordered factors, here's an approach where we specify the ordering, apply that to the data in V1:V3, and then proceed as before.

data <- data.frame(ID = c(1,1,2,3,3),
                   V1 = c("yes","no","yes","no","unsure"),
                   V2 = c("unsure","no",NA,"no","yes"),
                   V3 = c("yes","no","unsure","no",NA))

var_order <- c("yes", "no", "unsure")
data %>%
  mutate(across(V1:V3, ~factor(.x, levels = var_order))) %>%
  pivot_longer(-ID) %>%
  group_by(ID, name) %>%
  slice_min(value) %>%
  ungroup() %>% 
  pivot_wider(names_from = name)


# A tibble: 3 × 4
     ID V1    V2    V3    
  <dbl> <fct> <fct> <fct> 
1     1 yes   no    yes   
2     2 yes   NA    unsure
3     3 no    yes   no   
  • Related