Home > OS >  Efficient way of merging and grouping columns in dataframe to create summary column
Efficient way of merging and grouping columns in dataframe to create summary column

Time:09-18

I have a following dataframe:

df1 <- structure(list(group = c("KO", "WT", "KO", "KO", "KO", "KO", 
"WT", "KO", "KO", "WT", "WT", "WT", "WT", "WT", "WT", "WT", "WT", 
"WT", "WT", "KO", "KO"), name = c("rike", "rabe", "smake", "rike", 
"rike", "rike", "rabe", "rike", "rike", "due", "rabe", "ene", 
"ene", "due", "ene", "rabe", "due", "rabe", "due", "smake", "kum"
), type = c("C", "A", "A", "A", "C", "B", "A", "B", "B", "A", 
"B", "A", "C", "C", "C", "C", "B", "C", "A", "C", "A"), posit = c(10, 
2, 21, 5, 12, 22, 18, 19, 81, 22, 33, 31, 80, 40, 16, 16, 7, 
9, 26, 27, 7)), row.names = c(NA, -21L), class = "data.frame")

I would like to combine 2 columns, one character ("type") and one numeric ("posit") in that manner, that all of categories (letters) would be joined with corresponding posits (numbers), for instance "A" and "37" as "A37" and all of the type-posit pairs for given "name" would be pasted in new column in ascending order (from lesser to bigger values). Also I'd like them to be separated with ":". The desired output is given below:

df2 <-structure(list(group = c("WT", "WT", "WT", "KO", "KO", "KO"), 
    name = c("ene", "due", "rabe", "kum", "rike", "smake"), type_posit = c("C16:A31:C80", 
    "B7:A22:A26:C40", "A2:C9:C16:A18:B33", "A7", "A5:C10:C12:B19:B22:B81", 
    "A21:C27")), class = "data.frame", row.names = c(NA, -6L))

I can achieve this by using set of dplyr functions, and creating intermediate dataframes, like this:

df2 <- df1 %>% 
  dplyr::mutate(t_p = paste0(type,posit)) %>% 
  dplyr::arrange(name,posit) %>% 
  dplyr::select(-type, -posit) %>% 
  dplyr::group_by(group, name) %>% 
  dplyr::summarise(tag_pos =paste0(t_p, collapse = ":"))

However I wonder, whether there is more efficient and/or cleaner way to do so? I would like to write a clean, understandable code.

CodePudding user response:

Using aggregate basically.

aggregate(paste0(type, posit) ~ group   name, df1, paste, collapse=':')
#   group  name    paste0(type, posit)
# 1    WT   due         A22:C40:B7:A26
# 2    WT   ene            A31:C80:C16
# 3    KO   kum                     A7
# 4    WT  rabe      A2:A18:B33:C16:C9
# 5    KO  rike C10:A5:C12:B22:B19:B81
# 6    KO smake                A21:C27

Sorted version:

aggregate(paste0(type, posit) ~ group   name, df1, \(x) paste(sort(x), collapse=':'))
#   group  name    paste0(type, posit)
# 1    WT   due         A22:A26:B7:C40
# 2    WT   ene            A31:C16:C80
# 3    KO   kum                     A7
# 4    WT  rabe      A18:A2:B33:C16:C9
# 5    KO  rike A5:B19:B22:B81:C10:C12
# 6    KO smake                A21:C27

CodePudding user response:

You need to sort the rows by posit and then summarise each group with paste0(..., collapse = ':').

library(dplyr)

df1 %>%
  group_by(group, name) %>%
  arrange(posit, .by_group = TRUE) %>% 
  summarise(type_posit = paste0(type, posit, collapse = ':'), .groups = 'drop')

# # A tibble: 6 × 3
#   group name  type_posit            
#   <chr> <chr> <chr>                 
# 1 KO    kum   A7                    
# 2 KO    rike  A5:C10:C12:B19:B22:B81
# 3 KO    smake A21:C27               
# 4 WT    due   B7:A22:A26:C40        
# 5 WT    ene   C16:A31:C80           
# 6 WT    rabe  A2:C9:C16:A18:B33

CodePudding user response:

Using data.table

library(data.table)
setDT(df1)[order(posit), .(type_posit = paste(type, sep = "",
    posit, collapse = ":")),.(group, name)]

-output

 group   name             type_posit
   <char> <char>                 <char>
1:     WT   rabe      A2:C9:C16:A18:B33
2:     KO   rike A5:C10:C12:B19:B22:B81
3:     WT    due         B7:A22:A26:C40
4:     KO    kum                     A7
5:     WT    ene            C16:A31:C80
6:     KO  smake                A21:C27
  • Related