Create a table that record the number of row pairs that are not zero in R-CodePudding

Apologies if the title is confusing, but below is what I would like to accomplish. Let's say I have a table as seen below.

df <- data.frame(
  patient = paste0("patient",seq(1:6)),
  gene_1 = c(10,5,0,0,1,0),
  gene_2 = c(0,26,4,5,6,1),
  gene_3 = c(1,3,5,12,44,1)
)

patient	gene_1	gene_2	gene_3
patient1	10	0	1
patient2	5	26	3
patient3	0	4	5
patient4	0	5	12
patient5	1	6	44
patient6	0	1	1

What I want is another table that records the total number of pairs only if both values are non-zero. The table would look like so:

col1	col2	number-of-pairs
gene1	gene2	2
gene1	gene3	3
gene2	gene3	5

Any help is appreciated. Thank you.

CodePudding user response：

We can do this by pivoting your data to a long format, doing a self-join, and then filtering:

library(tidyr)
library(dplyr)
## Long format, keep only non-zeros
long_data = pivot_longer(df, -patient) %>%
  filter(value != 0) %>%
  select(-value)

## Self join on patient,
## Remove exact matches (can't pair with yourself)
## And use < to remove doublecounts
long_data %>%
  left_join(long_data, by = "patient") %>%
  filter(name.x != name.y & name.x < name.y) %>%
  count(name.x, name.y)
# # A tibble: 3 × 3
#   name.x name.y     n
#   <chr>  <chr>  <int>
# 1 gene_1 gene_2     2
# 2 gene_1 gene_3     3
# 3 gene_2 gene_3     5

CodePudding user response：

You can do this in an uninterrupted pipe by using combn:

library(tidyverse) 

df %>%
  pivot_longer(-1) %>%
  filter(value > 0) %>%
  group_by(patient) %>%
  summarize(pairs = apply(combn(name, 2), 2, paste, collapse = ' '),
            .groups = 'drop') %>%
  separate(pairs, sep = ' ', into = c('col1', 'col2')) %>%
  count(col1, col2)
#>  # A tibble: 3 x 3
#>    col1   col2       n
#>    <chr>  <chr>  <int>
#>  1 gene_1 gene_2     2
#>  2 gene_1 gene_3     3
#>  3 gene_2 gene_3     5

CodePudding user response：

You could do a simple for loop in which you are accessing each column of df, corces the columns into a logical vector of > 0, and then use & operator to find all the positions that are >0 in both. If you had not known, you can use sum on a logical vector to count how many TRUE values there are.

df <- data.frame(
  patient = paste0("patient",seq(1:6)),
  gene_1 = c(10,5,0,0,1,0),
  gene_2 = c(0,26,4,5,6,1),
  gene_3 = c(1,3,5,12,44,1)
)
gene_cols <- setdiff(colnames(df), "patient")
# Generate all the combinations
out <- as.data.frame(t(combn(gene_cols, 2)))
pairs <- vector("integer", nrow(out))
for (i in seq_len(length(pairs))) {
  pairs[i] <- sum(df[[out$V1[i]]]>0 & df[[out$V2[i]]]>0)
}
out$n_pairs <- pairs
out
#>       V1     V2 n_pairs
#> 1 gene_1 gene_2       2
#> 2 gene_1 gene_3       3
#> 3 gene_2 gene_3       5

^{Created on 2022-04-07 by the reprex package (v2.0.1)}

CodePudding user response：

A one-liner base R way:

table(unlist(apply(df[-1], 1, \(x) combn(names(x)[x != 0], m = 2, toString))))

# gene_1, gene_2 gene_1, gene_3 gene_2, gene_3 
#              2              3              5

You could use this to get the expected output:

tibble(col = unlist(apply(df[-1], 1, \(x) combn(names(x)[x != 0], m = 2, toString)))) %>% 
  separate(col, into = c("col1", "col2"), sep = ", ") %>% 
  count(col1, col2)

# A tibble: 3 x 3
  col1   col2       n
  <chr>  <chr>  <int>
1 gene_1 gene_2     2
2 gene_1 gene_3     3
3 gene_2 gene_3     5

CodePudding user response：

Here's another base R approach. Although it doesn't look elegant, it's actually quite the most efficient answer so far.

First create a combn_gene vector that contains the gene pairs. Then use sapply to go through all combinations of the gene pairs and see if the sum of the pair equals the original gene (since if the column contains 0, the sum would be the same as the original value). Then count the pairs that have different values after summing (i.e. the columns have non-zero values).

combn_gene <- t(combn(colnames(df)[-1], 2))

cbind(setNames(as.data.frame(combn_gene), c("col1", "col2")), 
      "number-of-pairs" = sapply(1:nrow(combn_gene), function(x) 
        colSums(
          !(
            (df[combn_gene[x, 1]] == df[combn_gene[x, 1]]   df[combn_gene[x, 2]]) | 
              (df[combn_gene[x, 2]] == df[combn_gene[x, 1]]   df[combn_gene[x, 2]])
            )
          ))
      )

    col1   col2 number-of-pairs
1 gene_1 gene_2               2
2 gene_1 gene_3               3
3 gene_2 gene_3               5

CodePudding user response：

it gives the result that you need but, i am not sure that it is good for your case because of proccess.

gene1_gene2 = df %>% filter(gene_1 != 0 & gene_2 !=0) %>% count() %>% rename(number_of_pairs = n)

gene1_gene3 = df %>% filter(gene_1 != 0 & gene_3 !=0) %>% count() %>% rename(number_of_pairs = n)

gene2_gene3 = df %>% filter(gene_2 != 0 & gene_3 !=0) %>% count() %>% rename(number_of_pairs = n)
number_of_pairs = rbind(gene1_gene2, gene1_gene3, gene2_gene3)

new_df = data.frame(
  col1 = c("gene1", "gene1", "gene2"),
  col2 = c("gene2", "gene3", "gene3"))

new_df$number_of_pairs = number_of_pairs

new_df
  col1  col2 number_of_pairs
1 gene1 gene2               2
2 gene1 gene3               3
3 gene2 gene3               5

CodePudding user response：

Another tidyverse approach might be:

map_dfr(.x = combn(names(select(df, starts_with("gene"))), 2, simplify = FALSE),
        ~ df %>%
            summarise(col1 = first(.x),
                      col2 = last(.x),
                      number = sum(rowSums(across(all_of(.x)) != 0) == 2)))

    col1   col2 number
1 gene_1 gene_2      2
2 gene_1 gene_3      3
3 gene_2 gene_3      5