Home > Blockchain >  Counting number of pairwise matches between genes in R
Counting number of pairwise matches between genes in R

Time:05-19

I have a dataframe of around 13,000 various genes and their InterPro domains in R.

I want to compute a matrix for all genes in the dataframe, with a pairwise value that represents how many matching InterPro domains are present between the two genes.

For example:

Gene    Interpro_domain_1    Interpro_domain_2    Interpro_domain_3
--------------------------------------------------------------------
Gene1   IPR000008            IPR001202            IPR035892
Gene2   IPR000008            IPR016024            NA
Gene3   IPR000664            IPR001202            IPR011009
Gene4   IPR001544            NA                   NA

Would become a matrix which looks like:

       |Gene1    Gene2    Gene3    Gene4
-------|---------------------------------
Gene1  |  3        1        1        0
Gene2  |  1        2        0        0
Gene3  |  1        0        3        0
Gene4  |  0        0        0        1

etc...

I want to do this for up to 20 domains.

I also have this data frame in list format if that is easier to work with.

Thank you.

CodePudding user response:

A possible way of doing this:

library(tidyverse)

df <- tribble(
  ~Gene, ~Interpro_domain_1, ~Interpro_domain_2, ~Interpro_domain_3,
  "Gene1", "IPR000008", "IPR001202", "IPR035892",
  "Gene2", "IPR000008", "IPR016024", NA,
  "Gene3", "IPR000664", "IPR001202", "IPR011009",
  "Gene4", "IPR001544", NA, NA
) 

df |> 
  pivot_longer(-Gene) |> 
  select(-name) |> 
  filter(!is.na(value)) |> 
  pivot_wider(names_from = Gene, values_from = Gene) |> 
  select(-value) |> 
  unite("Gene", everything(), na.rm = TRUE, remove = FALSE) |> 
  mutate(across(-Gene, ~if_else(!is.na(.), 1, 0))) |> 
  separate_rows(Gene) |> 
  group_by(Gene) |> 
  summarise(across(everything(), sum))
#> # A tibble: 4 × 5
#>   Gene  Gene1 Gene2 Gene3 Gene4
#>   <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Gene1     3     1     1     0
#> 2 Gene2     1     2     0     0
#> 3 Gene3     1     0     3     0
#> 4 Gene4     0     0     0     1

Created on 2022-05-18 by the reprex package (v2.0.1)

CodePudding user response:

crossprod(table(cbind(unlist(df[-1]), df[1])))
       Gene
Gene    Gene1 Gene2 Gene3 Gene4
  Gene1     3     1     1     0
  Gene2     1     2     0     0
  Gene3     1     0     3     0
  Gene4     0     0     0     1

If you need the result as a data.frame:

 a <- crossprod(table(cbind(unlist(df[-1]), df[1])))
 as.data.frame.matrix(a)
      Gene1 Gene2 Gene3 Gene4
Gene1     3     1     1     0
Gene2     1     2     0     0
Gene3     1     0     3     0
Gene4     0     0     0     1
  •  Tags:  
  • r
  • Related