Home > Software design >  R: Combine rows based on equal values in several columns
R: Combine rows based on equal values in several columns

Time:06-22

I have a dataframe that takes similar form as the toy dataframe below. I would like to merge the rows if var1, var2, and var3 are all equal values, creating a combination of data in the merged rows. For rows 4 - 6, where there are different values in the rows, I was wondering if there is a way to put them in the same column with a seperator in between.

df <- data.frame(var1 = c("1635", "1635", "1729", "1847", "1847", "1847"),
                 var2 = c("Aa", "Aa", "Bb", "Cc", "Cc", "Cc"), 
                 var3 = c("28", "28", "85", "27", "27", "27"), 
                 var4 = c("apple", NA, "orange", "pear", NA, NA),
                 var5 = c(NA, "tree", NA, NA, "ground", "desk")
                 )

So the output would look something like this:

enter image description here

CodePudding user response:

in base R you would do:

aggregate(.~var1 var2 var3, df, \(x)toString(unique(na.omit(x))), na.action = identity)

  var1 var2 var3   var4         var5
1 1847   Cc   27   pear ground, desk
2 1635   Aa   28  apple         tree
3 1729   Bb   85 orange 

in tidyverse:

library(tidyverse)
df %>%
  group_by(var1,var2,var3) %>%
  summarize(across(var4:var5, ~toString(unique(na.omit(.x)))),.groups = 'drop')

# Groups:   var1, var2 [3]
  var1  var2  var3  var4   var5          
  <chr> <chr> <chr> <chr>  <chr>         
1 1635  Aa    28    apple  "tree"        
2 1729  Bb    85    orange ""            
3 1847  Cc    27    pear   "ground, desk"

CodePudding user response:

With dplyr, you can group_by the three columns, then use summarize to concatenate the strings if they are not NA.

library(dplyr)

df %>% 
  group_by(var1, var2, var3) %>% 
  summarize(across(var4:var5, ~ifelse(all(is.na(.x)), NA, paste0(na.omit(.x), collapse = ","))), .groups = "drop")

# A tibble: 3 × 5
  var1  var2  var3  var4   var5       
  <chr> <chr> <chr> <chr>  <chr>      
1 1635  Aa    28    apple  tree       
2 1729  Bb    85    orange NA         
3 1847  Cc    27    pear   ground,desk

CodePudding user response:

With data.table:

setDT(df)
df[, 
  lapply(.SD, \(x) if (all(is.na(x))) NA_character_ else paste(na.omit(x), collapse = "; ")), 
  by = var1:var3]
setDF(df)
#      var1   var2   var3   var4         var5
#    <char> <char> <char> <char>       <char>
# 1:   1635     Aa     28  apple         tree
# 2:   1729     Bb     85 orange         <NA>
# 3:   1847     Cc     27   pear ground; desk
  • Related