I have a table like this:
test <- data.frame(chr=c("chr1", "chr1", "chr1", "chr2", "chr2", "chr2"), start=c(1,1,1,2,2,10), end=c(5,5,5,7,7,20), gene=c("g1", "g1", "g1", "g2", "g2", "g3"), chrQ=c("chr1", "chr1", "chr1", "chr2", "chr2", "chr2"), startq=c(1,1,1,2,3, 10), endq=c(5,5,6,7,7, 20), geneq=c("g1q", "g2q", "g3q", "g4q", "g5q", "g6q"))
> test
chr start end gene chrQ startq endq geneq
1 chr1 1 5 g1 chr1 1 5 g1q
2 chr1 1 5 g1 chr1 1 5 g2q
3 chr1 1 5 g1 chr1 1 6 g3q
4 chr2 2 7 g2 chr2 2 7 g4q
5 chr2 2 7 g2 chr2 3 7 g5q
6 chr2 10 20 g3 chr2 10 20 g6q
I would like to remove duplicated rows based on the column gene. And collapse the values of the columns named in this example: chrQ, startq, endq, geneq
I would like to convert that table to this
chr start end gene matched matched_total
1 chr1 1 5 g1 chr1 1 5 g1q; g1 chr1 1 5 g2q; chr1 1 6 g3q 3
2 chr2 2 7 g2 chr2 2 7 g4q; chr2 3 7 g5q 2
3 chr2 10 20 g3 chr2 10 20 g6q 1
I would like to add a column called matched, that contains the mentioned columns in a single row separated by ; or any other character and the count of duplicated rows to the column matched_total.
I know that I could eliminate dulplicated columns like this
test %>% distinct(gene, .keep_all = TRUE)
and that I could add the counts with something like this:
test_s <- test %>% group_by(gene) %>% summarize(Total=n())
Using the package dplyr, but I don´t know how to collapse the other columns. Could you please tell me how I could achieve this?
CodePudding user response:
You can't use distinct
, because then you will lose data needed to create the column matched
. Use summary
instead to collapse data from all rows belonging to one gene:
library(tidyverse)
test <- data.frame(
chr = c("chr1", "chr1", "chr1", "chr2", "chr2", "chr2"),
start = c(1, 1, 1, 2, 2, 10),
end = c(5, 5, 5, 7, 7, 20),
gene = c("g1", "g1", "g1", "g2", "g2", "g3"),
chrQ = c("chr1", "chr1", "chr1", "chr2", "chr2", "chr2"),
startq = c(1, 1, 1, 2, 3, 10),
endq = c(5, 5, 6, 7, 7, 20),
geneq = c("g1q", "g2q", "g3q", "g4q", "g5q", "g6q")
)
test %>%
group_by(chr, start, end, gene) %>%
unite("matched", chrQ, startq, endq, geneq, sep = " ") %>%
summarise(
matched = matched %>% paste0(collapse = "; "),
matched_total = n()
)
#> `summarise()` has grouped output by 'chr', 'start', 'end'. You can override
#> using the `.groups` argument.
#> # A tibble: 3 × 6
#> # Groups: chr, start, end [3]
#> chr start end gene matched matched_total
#> <chr> <dbl> <dbl> <chr> <chr> <int>
#> 1 chr1 1 5 g1 chr1 1 5 g1q; chr1 1 5 g2q; chr1 1 6 g3q 3
#> 2 chr2 2 7 g2 chr2 2 7 g4q; chr2 3 7 g5q 2
#> 3 chr2 10 20 g3 chr2 10 20 g6q 1
Created on 2022-04-01 by the reprex package (v2.0.0)
CodePudding user response:
Another possible solution:
library(tidyverse)
test %>%
mutate(across(where(is.numeric), as.character)) %>%
rowwise %>%
mutate(matched = str_c(c_across(chrQ:geneq), collapse = " ")) %>%
group_by(gene) %>%
summarise(matched = str_c(matched, collapse = "; "), matched_total = n())
#> # A tibble: 3 × 3
#> gene matched matched_total
#> <chr> <chr> <int>
#> 1 g1 chr1 1 5 g1q; chr1 1 5 g2q; chr1 1 6 g3q 3
#> 2 g2 chr2 2 7 g4q; chr2 3 7 g5q 2
#> 3 g3 chr2 10 20 g6q 1