Home > Net >  Find common rows across multiple, but not all available data frames, for all possible combinations o
Find common rows across multiple, but not all available data frames, for all possible combinations o

Time:03-31

I have multiple data frames with the following format:

        Gene Entrez.Id                                    Dataset Correlation
1     MTHFD2     10797 CRISPR (DepMap 22Q1 Public Score, Chronos)   0.3328479
2   SLC25A32     81034 CRISPR (DepMap 22Q1 Public Score, Chronos)   0.3111028
3    MTHFD1L     25902 CRISPR (DepMap 22Q1 Public Score, Chronos)   0.2710356
4       DTX3    196403 CRISPR (DepMap 22Q1 Public Score, Chronos)   0.2672314

My aim was to find elements in the Gene column that were common to all data frames, for which I used the following command:

df.join <- join_all(list(df1,df2,df3,df4,df5), by = "Gene", type = "inner")

But there are actually no Gene elements that are common to all data frames, so df.join is empty. Now I want to know whether there are elements in the Gene column that are common to most data frame but not all, let's say 4 out of 5. Is there a way to do this without manually constructing lines of code for all the possible combinations of data frames?

CodePudding user response:

One option involving dplyr and purrr could be:

ids_to_join <- mget(ls(pattern = "df")) %>%
    map_dfr(~ select(., "Gene"), .id = "dataset") %>%
    group_by(Gene) %>%
    summarise(n = n_distinct(dataset)) %>%
    ungroup() %>%
    filter(n == 5) %>% #The number corresponds to the required number of datasets
    pull(Gene)

mget(ls(pattern = "df")) %>%
    map(~ filter(., Gene %in% ids_to_join)) %>%
    reduce(inner_join, 
           by = "Gene")

In this approach, the IDs that are present in the required number of datasets (here n = 5) are identified. Then, in the second step, these IDs are filtered out and joined together.

If also the information on datasets is needed:

ids_to_join <- mget(ls(pattern = "df")) %>%
    map_dfr(~ select(., "Gene"), .id = "dataset") %>%
    group_by(Gene) %>%
    summarise(n = n_distinct(dataset),
              dataset = paste(dataset, collapse = ", ")) %>%
    ungroup() %>%
    filter(n == 5) %>%
    select(-n)

mget(ls(pattern = "df")) %>%
    map(~ filter(., Gene %in% ids_to_join[["Gene"]])) %>%
    reduce(inner_join, 
           by = "Gene") %>%
    left_join(ids_to_join,
              by = "Gene")
  • Related