I have a dataframe of around 13,000 various genes and their InterPro domains in R.
I want to compute a matrix for all genes in the dataframe, with a pairwise value that represents how many matching InterPro domains are present between the two genes.
For example:
Gene Interpro_domain_1 Interpro_domain_2 Interpro_domain_3
--------------------------------------------------------------------
Gene1 IPR000008 IPR001202 IPR035892
Gene2 IPR000008 IPR016024 NA
Gene3 IPR000664 IPR001202 IPR011009
Gene4 IPR001544 NA NA
Would become a matrix which looks like:
|Gene1 Gene2 Gene3 Gene4
-------|---------------------------------
Gene1 | 3 1 1 0
Gene2 | 1 2 0 0
Gene3 | 1 0 3 0
Gene4 | 0 0 0 1
etc...
I want to do this for up to 20 domains.
I also have this data frame in list format if that is easier to work with.
Thank you.
CodePudding user response:
A possible way of doing this:
library(tidyverse)
df <- tribble(
~Gene, ~Interpro_domain_1, ~Interpro_domain_2, ~Interpro_domain_3,
"Gene1", "IPR000008", "IPR001202", "IPR035892",
"Gene2", "IPR000008", "IPR016024", NA,
"Gene3", "IPR000664", "IPR001202", "IPR011009",
"Gene4", "IPR001544", NA, NA
)
df |>
pivot_longer(-Gene) |>
select(-name) |>
filter(!is.na(value)) |>
pivot_wider(names_from = Gene, values_from = Gene) |>
select(-value) |>
unite("Gene", everything(), na.rm = TRUE, remove = FALSE) |>
mutate(across(-Gene, ~if_else(!is.na(.), 1, 0))) |>
separate_rows(Gene) |>
group_by(Gene) |>
summarise(across(everything(), sum))
#> # A tibble: 4 × 5
#> Gene Gene1 Gene2 Gene3 Gene4
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Gene1 3 1 1 0
#> 2 Gene2 1 2 0 0
#> 3 Gene3 1 0 3 0
#> 4 Gene4 0 0 0 1
Created on 2022-05-18 by the reprex package (v2.0.1)
CodePudding user response:
crossprod(table(cbind(unlist(df[-1]), df[1])))
Gene
Gene Gene1 Gene2 Gene3 Gene4
Gene1 3 1 1 0
Gene2 1 2 0 0
Gene3 1 0 3 0
Gene4 0 0 0 1
If you need the result as a data.frame:
a <- crossprod(table(cbind(unlist(df[-1]), df[1])))
as.data.frame.matrix(a)
Gene1 Gene2 Gene3 Gene4
Gene1 3 1 1 0
Gene2 1 2 0 0
Gene3 1 0 3 0
Gene4 0 0 0 1