Convert duplicate rows to one row with with some of the values collapsed into a single column separa-CodePudding

I have a table like this:

 test <- data.frame(chr=c("chr1", "chr1", "chr1", "chr2", "chr2", "chr2"), start=c(1,1,1,2,2,10), end=c(5,5,5,7,7,20), gene=c("g1", "g1", "g1", "g2", "g2", "g3"), chrQ=c("chr1", "chr1", "chr1", "chr2", "chr2", "chr2"), startq=c(1,1,1,2,3, 10), endq=c(5,5,6,7,7, 20), geneq=c("g1q", "g2q", "g3q", "g4q", "g5q", "g6q"))

> test
   chr start end gene chrQ startq endq geneq
1 chr1     1   5   g1 chr1      1    5   g1q
2 chr1     1   5   g1 chr1      1    5   g2q
3 chr1     1   5   g1 chr1      1    6   g3q
4 chr2     2   7   g2 chr2      2    7   g4q
5 chr2     2   7   g2 chr2      3    7   g5q
6 chr2    10  20   g3 chr2     10   20   g6q

I would like to remove duplicated rows based on the column gene. And collapse the values of the columns named in this example: chrQ, startq, endq, geneq

I would like to convert that table to this

    chr start end gene matched                                         matched_total
 1 chr1     1   5   g1    chr1 1 5 g1q; g1 chr1 1 5 g2q; chr1 1 6 g3q              3
 2 chr2     2   7   g2    chr2 2 7 g4q; chr2 3 7 g5q                               2
 3 chr2    10  20   g3    chr2 10 20 g6q                                           1

I would like to add a column called matched, that contains the mentioned columns in a single row separated by ; or any other character and the count of duplicated rows to the column matched_total.

I know that I could eliminate dulplicated columns like this

test %>% distinct(gene, .keep_all = TRUE)

and that I could add the counts with something like this:

test_s <- test  %>%  group_by(gene)  %>% summarize(Total=n())

Using the package dplyr, but I don´t know how to collapse the other columns. Could you please tell me how I could achieve this?

CodePudding user response：

You can't use distinct, because then you will lose data needed to create the column matched. Use summary instead to collapse data from all rows belonging to one gene:

library(tidyverse)

test <- data.frame(
  chr = c("chr1", "chr1", "chr1", "chr2", "chr2", "chr2"),
  start = c(1, 1, 1, 2, 2, 10),
  end = c(5, 5, 5, 7, 7, 20),
  gene = c("g1", "g1", "g1", "g2", "g2", "g3"),
  chrQ = c("chr1", "chr1", "chr1", "chr2", "chr2", "chr2"),
  startq = c(1, 1, 1, 2, 3, 10),
  endq = c(5, 5, 6, 7, 7, 20),
  geneq = c("g1q", "g2q", "g3q", "g4q", "g5q", "g6q")
)

test %>%
  group_by(chr, start, end, gene) %>%
  unite("matched", chrQ, startq, endq, geneq, sep = " ") %>%
  summarise(
    matched = matched %>% paste0(collapse = "; "),
    matched_total = n()
  )
#> `summarise()` has grouped output by 'chr', 'start', 'end'. You can override
#> using the `.groups` argument.
#> # A tibble: 3 × 6
#> # Groups:   chr, start, end [3]
#>   chr   start   end gene  matched                                  matched_total
#>   <chr> <dbl> <dbl> <chr> <chr>                                            <int>
#> 1 chr1      1     5 g1    chr1 1 5 g1q; chr1 1 5 g2q; chr1 1 6 g3q             3
#> 2 chr2      2     7 g2    chr2 2 7 g4q; chr2 3 7 g5q                           2
#> 3 chr2     10    20 g3    chr2 10 20 g6q                                       1

^{Created on 2022-04-01 by the reprex package (v2.0.0)}

CodePudding user response：

Another possible solution:

library(tidyverse)

test %>% 
  mutate(across(where(is.numeric), as.character)) %>% 
  rowwise %>% 
  mutate(matched = str_c(c_across(chrQ:geneq), collapse = " ")) %>% 
  group_by(gene) %>% 
  summarise(matched = str_c(matched, collapse = "; "), matched_total = n())

#> # A tibble: 3 × 3
#>   gene  matched                                  matched_total
#>   <chr> <chr>                                            <int>
#> 1 g1    chr1 1 5 g1q; chr1 1 5 g2q; chr1 1 6 g3q             3
#> 2 g2    chr2 2 7 g4q; chr2 3 7 g5q                           2
#> 3 g3    chr2 10 20 g6q                                       1