Home > Software engineering >  Convert duplicate rows to one row with with some of the values collapsed into a single column separa
Convert duplicate rows to one row with with some of the values collapsed into a single column separa

Time:04-01

I have a table like this:

 test <- data.frame(chr=c("chr1", "chr1", "chr1", "chr2", "chr2", "chr2"), start=c(1,1,1,2,2,10), end=c(5,5,5,7,7,20), gene=c("g1", "g1", "g1", "g2", "g2", "g3"), chrQ=c("chr1", "chr1", "chr1", "chr2", "chr2", "chr2"), startq=c(1,1,1,2,3, 10), endq=c(5,5,6,7,7, 20), geneq=c("g1q", "g2q", "g3q", "g4q", "g5q", "g6q"))

> test
   chr start end gene chrQ startq endq geneq
1 chr1     1   5   g1 chr1      1    5   g1q
2 chr1     1   5   g1 chr1      1    5   g2q
3 chr1     1   5   g1 chr1      1    6   g3q
4 chr2     2   7   g2 chr2      2    7   g4q
5 chr2     2   7   g2 chr2      3    7   g5q
6 chr2    10  20   g3 chr2     10   20   g6q

I would like to remove duplicated rows based on the column gene. And collapse the values of the columns named in this example: chrQ, startq, endq, geneq

I would like to convert that table to this

    chr start end gene matched                                         matched_total
 1 chr1     1   5   g1    chr1 1 5 g1q; g1 chr1 1 5 g2q; chr1 1 6 g3q              3
 2 chr2     2   7   g2    chr2 2 7 g4q; chr2 3 7 g5q                               2
 3 chr2    10  20   g3    chr2 10 20 g6q                                           1

I would like to add a column called matched, that contains the mentioned columns in a single row separated by ; or any other character and the count of duplicated rows to the column matched_total.

I know that I could eliminate dulplicated columns like this

test %>% distinct(gene, .keep_all = TRUE)

and that I could add the counts with something like this:

test_s <- test  %>%  group_by(gene)  %>% summarize(Total=n())

Using the package dplyr, but I don´t know how to collapse the other columns. Could you please tell me how I could achieve this?

CodePudding user response:

You can't use distinct, because then you will lose data needed to create the column matched. Use summary instead to collapse data from all rows belonging to one gene:

library(tidyverse)

test <- data.frame(
  chr = c("chr1", "chr1", "chr1", "chr2", "chr2", "chr2"),
  start = c(1, 1, 1, 2, 2, 10),
  end = c(5, 5, 5, 7, 7, 20),
  gene = c("g1", "g1", "g1", "g2", "g2", "g3"),
  chrQ = c("chr1", "chr1", "chr1", "chr2", "chr2", "chr2"),
  startq = c(1, 1, 1, 2, 3, 10),
  endq = c(5, 5, 6, 7, 7, 20),
  geneq = c("g1q", "g2q", "g3q", "g4q", "g5q", "g6q")
)

test %>%
  group_by(chr, start, end, gene) %>%
  unite("matched", chrQ, startq, endq, geneq, sep = " ") %>%
  summarise(
    matched = matched %>% paste0(collapse = "; "),
    matched_total = n()
  )
#> `summarise()` has grouped output by 'chr', 'start', 'end'. You can override
#> using the `.groups` argument.
#> # A tibble: 3 × 6
#> # Groups:   chr, start, end [3]
#>   chr   start   end gene  matched                                  matched_total
#>   <chr> <dbl> <dbl> <chr> <chr>                                            <int>
#> 1 chr1      1     5 g1    chr1 1 5 g1q; chr1 1 5 g2q; chr1 1 6 g3q             3
#> 2 chr2      2     7 g2    chr2 2 7 g4q; chr2 3 7 g5q                           2
#> 3 chr2     10    20 g3    chr2 10 20 g6q                                       1

Created on 2022-04-01 by the reprex package (v2.0.0)

CodePudding user response:

Another possible solution:

library(tidyverse)

test %>% 
  mutate(across(where(is.numeric), as.character)) %>% 
  rowwise %>% 
  mutate(matched = str_c(c_across(chrQ:geneq), collapse = " ")) %>% 
  group_by(gene) %>% 
  summarise(matched = str_c(matched, collapse = "; "), matched_total = n())

#> # A tibble: 3 × 3
#>   gene  matched                                  matched_total
#>   <chr> <chr>                                            <int>
#> 1 g1    chr1 1 5 g1q; chr1 1 5 g2q; chr1 1 6 g3q             3
#> 2 g2    chr2 2 7 g4q; chr2 3 7 g5q                           2
#> 3 g3    chr2 10 20 g6q                                       1
  • Related