I have multiple data frames with the following format:
Gene Entrez.Id Dataset Correlation
1 MTHFD2 10797 CRISPR (DepMap 22Q1 Public Score, Chronos) 0.3328479
2 SLC25A32 81034 CRISPR (DepMap 22Q1 Public Score, Chronos) 0.3111028
3 MTHFD1L 25902 CRISPR (DepMap 22Q1 Public Score, Chronos) 0.2710356
4 DTX3 196403 CRISPR (DepMap 22Q1 Public Score, Chronos) 0.2672314
My aim was to find elements in the Gene
column that were common to all data frames, for which I used the following command:
df.join <- join_all(list(df1,df2,df3,df4,df5), by = "Gene", type = "inner")
But there are actually no Gene
elements that are common to all data frames, so df.join
is empty.
Now I want to know whether there are elements in the Gene
column that are common to most data frame but not all, let's say 4 out of 5. Is there a way to do this without manually constructing lines of code for all the possible combinations of data frames?
CodePudding user response:
One option involving dplyr
and purrr
could be:
ids_to_join <- mget(ls(pattern = "df")) %>%
map_dfr(~ select(., "Gene"), .id = "dataset") %>%
group_by(Gene) %>%
summarise(n = n_distinct(dataset)) %>%
ungroup() %>%
filter(n == 5) %>% #The number corresponds to the required number of datasets
pull(Gene)
mget(ls(pattern = "df")) %>%
map(~ filter(., Gene %in% ids_to_join)) %>%
reduce(inner_join,
by = "Gene")
In this approach, the IDs that are present in the required number of datasets (here n = 5) are identified. Then, in the second step, these IDs are filtered out and joined together.
If also the information on datasets is needed:
ids_to_join <- mget(ls(pattern = "df")) %>%
map_dfr(~ select(., "Gene"), .id = "dataset") %>%
group_by(Gene) %>%
summarise(n = n_distinct(dataset),
dataset = paste(dataset, collapse = ", ")) %>%
ungroup() %>%
filter(n == 5) %>%
select(-n)
mget(ls(pattern = "df")) %>%
map(~ filter(., Gene %in% ids_to_join[["Gene"]])) %>%
reduce(inner_join,
by = "Gene") %>%
left_join(ids_to_join,
by = "Gene")