I have the following data set with biotyped ensembl genes and some code to count the number of genes for each biotype.
genes <- c("ENSG01","ENSG02","ENSG03","ENSG04","ENSG05")
biotype <- c("protein_coding","protein_coding","protein_coding","lncRNA","lncRNA")
data <- data.frame(genes, biotype)
data
genes biotype
1 ENSG01 protein_coding
2 ENSG02 protein_coding
3 ENSG03 protein_coding
4 ENSG04 lncRNA
5 ENSG05 lncRNA
data_cts <- data %>%
group_by(biotype) %>%
dplyr::count()
data_cts
# A tibble: 2 × 2
# Groups: biotype [2]
biotype n
<chr> <int>
1 lncRNA 2
2 protein_coding 3
How can I retain the gene ensembl id's of those counted genes in a new column like shown below?
ENSEMBL <- c("ENSG04/ENSG05","ENSG01/ENSG02/ENSG03")
data_genes <- data.frame(data_cts, ENSEMBL)
data_genes
biotype n ENSEMBL
1 lncRNA 2 ENSG04/ENSG05
2 protein_coding 3 ENSG01/ENSG02/ENSG03
Thanks in advance
CodePudding user response:
Update: many thanks to @langtang: We could shorten the code:
library(dplyr)
data %>%
group_by(biotype) %>%
summarize(n = n(), ENSEMBL = paste0(genes,collapse="/"))
We could do it this way by grouping and counting and finally summarising:
library(dplyr)
data %>%
group_by(biotype) %>%
add_count() %>%
group_by(biotype, n) %>%
summarise(ENSEMBL = paste(genes, collapse = "/")) %>%
ungroup()
biotype n ENSEMBL
<chr> <int> <chr>
1 lncRNA 2 ENSG04/ENSG05
2 protein_coding 3 ENSG01/ENSG02/ENSG03