How to create a vector list in a dataframe where there is repetition of values-CodePudding

I have a dataframe like this:

	pathways	genes
1	REACTOME_2_LTR_CIRCLE_FORMATION	ENSG00000175334
2	REACTOME_A_TETRASACCHARIDE_LINKER_SEQUENCE_IS_REQUIRED_FOR_GAG_SYNTHESIS	ENSG00000109956
3	REACTOME_ABC_FAMILY_PROTEINS_MEDIATED_TRANSPORT	ENSG00000072849
5	REACTOME_CELL_CYCLE	ENSG00000196230
12	REACTOME_CELL_CYCLE	ENSG00000101162
13	REACTOME_CELL_CYCLE	ENSG00000137267

I would like to create a vector c() of all the pathways for a sigle gene. I tried with group_by() in dplyr but it is not working.

sub_pathway=sub_path%>%
  group_by(genes)%>%
  summarise(n())

It gives me just the count. If i do only summarise(), it gives me only the gene column list. I also try a loop but it is turning until yesterday.

structure(list(pathways = c("REACTOME_2_LTR_CIRCLE_FORMATION", "REACTOME_A_TETRASACCHARIDE_LINKER_SEQUENCE_IS_REQUIRED_FOR_GAG_SYNTHESIS", "REACTOME_ABC_FAMILY_PROTEINS_MEDIATED_TRANSPORT", "REACTOME_CELL_CYCLE", "REACTOME_CELL_CYCLE", "REACTOME_CELL_CYCLE"), genes = c("ENSG00000175334", "ENSG00000109956", "ENSG00000072849", "ENSG00000196230", "ENSG00000101162", "ENSG00000137267")), row.names = c(1L, 2L, 3L, 5L, 12L, 13L), class = "data.frame")

CodePudding user response：

Since your data is in a data.frame, you cannot put all pathways for a gene into a single vector. In fact, each column in the table is a vector, and vectors in R are flat: they cannot be nested.

However, you can use list columns: lists are R’s way of nesting structures. Therefore, the following works:

sub_pathway = sub_path %>%
    group_by(genes) %>%
    summarise(pathways = list(pathways))

The result is a table with one row per gene, and the pathways column is a list with, for each row, one vector of pathways.

Unfortunately, R doesn’t make it very easy to work with list columns, so the resulting data might not be very easy to work with. For example, if you want to output the data it might be more convenient to merge the pathways into a character per gene:

sub_path %>%
    group_by(genes) %>%
    summarise(pathways = paste(pathways, collapse = ', '))
#   genes           pathways
#   <chr>           <chr>
# 1 ENSG00000000419 REACTOME_DISEASES_ASSOCIATED_WITH_GLYCOSYLATION_PRECURSOR_BIOSYNTHESIS…
# 2 ENSG00000000938 REACTOME_FCGAMMA_RECEPTOR_FCGR_DEPENDENT_PHAGOCYTOSIS, REACTOME_FCGR_A…

What’s more convenient depends on what you need to do with the data afterwards.