How can I generate every combination of groups from a list of pairs?-CodePudding

I successfully used pandas to create a Pearson correlation matrix. From this matrix, I was able to extract many pairs of genes that correlated more than a certain threshold (0.75 in this instance) and then saved each pair as a tuple within a large list.

How do I now go about through this list to generate every possible correlated gene combination that's more than just pairs? For example:

Lets say there are 4 genes: A, B, C, and D. Each gene is correlated with the other (and itself obviously). This means somewhere in the big list there are the following 10 separate tuple pairs [(A,A), (B,B), (C,C), (D,D), (A,B), (A,C), (A,D), (B,C), (B,D), (C,D)]. However, from this you could also now create the tuples (A, B, C) and (A, B, C, D) and (B, C, D) and (A, C, D) and (A, B, D) and so on since they all correlate with each other. How do I go about writing a function that can create every combination of these new groups using just the pairs I currently have saved in my list?

By the way, if gene A correlated with gene B, and gene B correlated with gene C, but gene A did not correlate with gene C, they would NOT form the group (A, B, C). Everything has to correlate with each other to form a group.

The number of tuple pairs is in the millions, so finding an efficient way to get this done would be ideal (but not essential).

I am honestly not sure where I can begin. I was recommended to use a function that would give me all subsets of an array but that doesn't really help with the overall issue. Here are some possible steps I thought of that would technically get me what I want but would be extremely inefficient.

I make a simple list of every single gene there is from all the pairs.
I run the command to generate every possible subset of this list of genes.
I then comb through every single generated subset and using the paired list check that everything within the subset correlates with each other. If it doesn't, toss that subset out.
The remaining non tossed out subsets are my answers.

Sample Input: [(A,A), (B,B), (C,C), (D,D), (E,E), (A,B), (A,C), (A,D), (B,C), (B,D), (C,D), (C,E)]

Sample Output: [(A,A), (B,B), (C,C), (D,D), (E,E), (A,B), (A,C), (A,D), (B,C), (B,D), (C,D), (C,E), (A,B,C,D), (A,B,C), (B,C,D), (A,C,D), (A,B,D)]

Note how E isn't found in any of the new combinations since it only correlates with itself and C so it can't be included in any other group.

CodePudding user response：

First, this is actually an interesting question once I understood what you were after. This can be done, but my intuition tells me there is probably a much better, more efficient way than what I have here. This uses only itertools.combinations and list comprehensions and a set (just to prevent duplicates).

from itertools import combinations

genes = ["A", "B", "C", "D"]

# We assume each pair is lexicographical.
# If not, they must be sorted.
pairs = [
    ("A", "A"),
    ("B", "B"),
    ("C", "C"),
    ("D", "D"),
    ("E", "E"),
    ("A", "B"),
    ("A", "C"),
    ("A", "D"),
    ("B", "C"),
    ("B", "D"),
    ("C", "D"),
    ("C", "E"),
]

# List of lengths of all combinations greater than pairs.
# So, if you have 6 unique genes [A,B,C,D,E,F], this will be [3, 4, 5, 6]
lengths = [l   1 for l in range(len(genes)) if l   1 > 2]

# Combinations that can be built from the pairs.
new_combs = {c for l in lengths for c in combinations(genes, l)}

# This is a list of those combinations, where each pair built for a given
# combination is found in the original pairs list.
# For example, if a new combination is [A,B,C], it is only included in
# correlations if AB, BC, and AC are all found in pairs.
correlations = {x for x in new_combs if all([c in pairs for c in combinations(x, 2)])}
print(list(correlations))

Output

ᐅ ./main.py
[('A', 'B', 'C'), ('A', 'B', 'C', 'D'), ('A', 'B', 'D'), ('B', 'C', 'D'), ('A', 'C', 'D')]

CodePudding user response：

This falls into the class of problems known as clique problems.

Specifically you want to list all maximal cliques of the graph whose vertices are your genes and whose edges are your corellated pairs.

According to wikipedia the most efficient algorithm in pratice is the Bron–Kerbosch algorithm (or its variations) from 1973.