This was a previous question I asked I nearly have the solution provided by stack community
My dummy dataframe
df1 = data.frame(genes = c('gene1', 'gene3', 'gene4', 'gene2','gene11'))
df2 = data.frame(genes = c('gene3', 'gene2', 'gene5', 'gene1', "genet",'gene11'))
df3 = data.frame(genes = c('gene6', 'gene3', 'gene4', 'gdene7','gene11', 'genex', "gene10"))
dfList <- list(df1, df2, df3)
To see if my set of genes are present in each dataframe or not I ran this
dfList %>%
bind_rows(.id="df") %>%
pivot_wider(names_from=df, names_prefix="df", values_from=df)
Which gives something like this
genes df1 df2 df3
<chr> <chr> <chr> <chr>
gene1 1 2 NA
gene3 1 2 3
gene4 1 NA 3
gene2 1 2 NA
gene11 1 2 3
gene5 NA 2 NA
genet NA 2 NA
gene6 NA NA 3
gdene7 NA NA 3
genex NA NA 3
gene10 NA NA 3
In the above dataframe I can see for each gene, whether or not that gene is in df1
, df2
, or df3
. The problem I have is to find genes based on their presence and absence in different columns/condition
The below code as suggested to do what Im trying to do.
dfList %>%
bind_rows(.id="df") %>%
group_by(genes) %>%
summarise(minDF=min(df), maxDF=max(df)) %>%
filter(minDF == maxDF & maxDF == 3) %>%
pull(genes)
'gdene7''gene10''gene6''genex'
It identifies genes, in this example, that are only present in df3
. How do I identify genes that are, for example:
- present in all three frames?
- present in two specified frames, say
df1
anddf2
- present only in one frame, say
df2
?
Any help or suggestion would be really appreciated
CodePudding user response:
There are a couple of options here
- You can modify your
pivot_longer()
call to include thevalue_fn
param to make logical vectors, and then userowwise()
, and useany
orall
. For example, if you want to get the genes that are in df1, df2, and df3, you can do this:
dfList %>%
bind_rows(.id="df") %>%
pivot_wider(names_from=df, names_prefix="df", values_from=df,values_fn = \(x) !is.na(x)) %>%
rowwise() %>%
filter(all(df1,df2,df3)) %>%
pull(genes)
[1] "gene3" "gene11"
If you change the filter(all())
line to filter(all(df1,df2))
, this will give you the genes in both df1 and df2, without regard to df3
[1] "gene1" "gene3" "gene2" "gene11"
- However, you don't need these intermediate steps. Instead, you can use this flexible function below, called
find_genes()
, which takes a functionf
, such asintersect
,union
, orsetdiff
, and any number of dfs, and returns the set of genes across those dfs according to the set-function requested
find_genes <- function(f,...) {
Reduce(f,c(...))
}
Now, if you want to know the genes only in df3, and not in any others, do this:
find_genes(setdiff,df3,df2,df1)
Output:
[1] "gene6" "gdene7" "genex" "gene10"
If you want to know genes that are present in both df1 and df2, pass intersect
to the find_genes()
function, like this:
find_genes(intersect, df1,df2)
Output:
[1] "gene1" "gene3" "gene2" "gene11"
If you want to know genes that are present in all three, pass all three frames
find_genes(intersect, df3, df2, df1)
Output:
[1] "gene3" "gene11"
If you want the union of genes in two frames, pass union
:
find_genes(union, df1, df3)
Output:
"gene1" "gene3" "gene4" "gene2" "gene11" "gene6" "gdene7" "genex" "gene10"