Apologies if the title is confusing, but below is what I would like to accomplish. Let's say I have a table as seen below.
df <- data.frame(
patient = paste0("patient",seq(1:6)),
gene_1 = c(10,5,0,0,1,0),
gene_2 = c(0,26,4,5,6,1),
gene_3 = c(1,3,5,12,44,1)
)
patient | gene_1 | gene_2 | gene_3 |
---|---|---|---|
patient1 | 10 | 0 | 1 |
patient2 | 5 | 26 | 3 |
patient3 | 0 | 4 | 5 |
patient4 | 0 | 5 | 12 |
patient5 | 1 | 6 | 44 |
patient6 | 0 | 1 | 1 |
What I want is another table that records the total number of pairs only if both values are non-zero. The table would look like so:
col1 | col2 | number-of-pairs |
---|---|---|
gene1 | gene2 | 2 |
gene1 | gene3 | 3 |
gene2 | gene3 | 5 |
Any help is appreciated. Thank you.
CodePudding user response:
We can do this by pivoting your data to a long format, doing a self-join, and then filtering:
library(tidyr)
library(dplyr)
## Long format, keep only non-zeros
long_data = pivot_longer(df, -patient) %>%
filter(value != 0) %>%
select(-value)
## Self join on patient,
## Remove exact matches (can't pair with yourself)
## And use < to remove doublecounts
long_data %>%
left_join(long_data, by = "patient") %>%
filter(name.x != name.y & name.x < name.y) %>%
count(name.x, name.y)
# # A tibble: 3 × 3
# name.x name.y n
# <chr> <chr> <int>
# 1 gene_1 gene_2 2
# 2 gene_1 gene_3 3
# 3 gene_2 gene_3 5
CodePudding user response:
You can do this in an uninterrupted pipe by using combn
:
library(tidyverse)
df %>%
pivot_longer(-1) %>%
filter(value > 0) %>%
group_by(patient) %>%
summarize(pairs = apply(combn(name, 2), 2, paste, collapse = ' '),
.groups = 'drop') %>%
separate(pairs, sep = ' ', into = c('col1', 'col2')) %>%
count(col1, col2)
#> # A tibble: 3 x 3
#> col1 col2 n
#> <chr> <chr> <int>
#> 1 gene_1 gene_2 2
#> 2 gene_1 gene_3 3
#> 3 gene_2 gene_3 5
CodePudding user response:
You could do a simple for loop in which you are accessing each column of df
, corces the columns into a logical vector of > 0
, and then use &
operator to find all the positions that are >0
in both. If you had not known, you can use sum
on a logical vector to count how many TRUE
values there are.
df <- data.frame(
patient = paste0("patient",seq(1:6)),
gene_1 = c(10,5,0,0,1,0),
gene_2 = c(0,26,4,5,6,1),
gene_3 = c(1,3,5,12,44,1)
)
gene_cols <- setdiff(colnames(df), "patient")
# Generate all the combinations
out <- as.data.frame(t(combn(gene_cols, 2)))
pairs <- vector("integer", nrow(out))
for (i in seq_len(length(pairs))) {
pairs[i] <- sum(df[[out$V1[i]]]>0 & df[[out$V2[i]]]>0)
}
out$n_pairs <- pairs
out
#> V1 V2 n_pairs
#> 1 gene_1 gene_2 2
#> 2 gene_1 gene_3 3
#> 3 gene_2 gene_3 5
Created on 2022-04-07 by the reprex package (v2.0.1)
CodePudding user response:
A one-liner base R way:
table(unlist(apply(df[-1], 1, \(x) combn(names(x)[x != 0], m = 2, toString))))
# gene_1, gene_2 gene_1, gene_3 gene_2, gene_3
# 2 3 5
You could use this to get the expected output:
tibble(col = unlist(apply(df[-1], 1, \(x) combn(names(x)[x != 0], m = 2, toString)))) %>%
separate(col, into = c("col1", "col2"), sep = ", ") %>%
count(col1, col2)
# A tibble: 3 x 3
col1 col2 n
<chr> <chr> <int>
1 gene_1 gene_2 2
2 gene_1 gene_3 3
3 gene_2 gene_3 5
CodePudding user response:
Here's another base R approach. Although it doesn't look elegant, it's actually quite the most efficient answer so far.
First create a combn_gene
vector that contains the gene pairs. Then use sapply
to go through all combinations of the gene pairs and see if the sum of the pair equals the original gene
(since if the column contains 0, the sum would be the same as the original value). Then count the pairs that have different values after summing (i.e. the columns have non-zero values).
combn_gene <- t(combn(colnames(df)[-1], 2))
cbind(setNames(as.data.frame(combn_gene), c("col1", "col2")),
"number-of-pairs" = sapply(1:nrow(combn_gene), function(x)
colSums(
!(
(df[combn_gene[x, 1]] == df[combn_gene[x, 1]] df[combn_gene[x, 2]]) |
(df[combn_gene[x, 2]] == df[combn_gene[x, 1]] df[combn_gene[x, 2]])
)
))
)
col1 col2 number-of-pairs
1 gene_1 gene_2 2
2 gene_1 gene_3 3
3 gene_2 gene_3 5
CodePudding user response:
it gives the result that you need but, i am not sure that it is good for your case because of proccess.
gene1_gene2 = df %>% filter(gene_1 != 0 & gene_2 !=0) %>% count() %>% rename(number_of_pairs = n)
gene1_gene3 = df %>% filter(gene_1 != 0 & gene_3 !=0) %>% count() %>% rename(number_of_pairs = n)
gene2_gene3 = df %>% filter(gene_2 != 0 & gene_3 !=0) %>% count() %>% rename(number_of_pairs = n)
number_of_pairs = rbind(gene1_gene2, gene1_gene3, gene2_gene3)
new_df = data.frame(
col1 = c("gene1", "gene1", "gene2"),
col2 = c("gene2", "gene3", "gene3"))
new_df$number_of_pairs = number_of_pairs
new_df
col1 col2 number_of_pairs
1 gene1 gene2 2
2 gene1 gene3 3
3 gene2 gene3 5
CodePudding user response:
Another tidyverse
approach might be:
map_dfr(.x = combn(names(select(df, starts_with("gene"))), 2, simplify = FALSE),
~ df %>%
summarise(col1 = first(.x),
col2 = last(.x),
number = sum(rowSums(across(all_of(.x)) != 0) == 2)))
col1 col2 number
1 gene_1 gene_2 2
2 gene_1 gene_3 3
3 gene_2 gene_3 5