What is the best way to find all pairs in a list with a given condition?-CodePudding

Say I have a list ['ab','cd','ac','de'] and I want to find all possible pairs where both elements do not have any repeating letter.

The answer would be ['ab','cd'],['ab','de'],['ac','de'].

But I do not know the best possible way/algorithm to solve this. Currently what I am doing is using a nested for loop to check with each element and then removing it to shorten the list, but when my list is 1000s of lines long and the elements more complicated, it starts to take a noticeable amount of time.

arr4 = ['ab','cd','ac','de']
for i in arr4:
    for x in arr4:
        if not bool(set(i)&set(x)):
            print('[' i ',' x ']')
    print('----')
    arr4.remove(i)

CodePudding user response：

Your solution isn't bad, as far as the basic algorithm goes, but you shouldn't modify the list you're iterating over like you do. Also, there's some optimisation possible:

from random import sample
from timeit import timeit
from itertools import combinations


size = 3  # length of the strings
chars = [chr(n) for n in range(97, 123)]
strings = {''.join(sample(chars, size)) for _ in range(1000)}


# your solution as a one-liner
def pair_strs1(ps):
    return [(p, q) for p in ps for q in ps if not (set(p) & set(q))]


# this doesn't work, unless there's no initial pairs with double letters like 'aa'
def pair_strs2(ps):
    return [(p, q) for p in ps for q in ps if len(set(p q)) == 4]


# your solution, but compute the sets only once, then look them up before combining
def pair_strs3(ps):
    ps = {p: set(p) for p in ps}
    return [(p, q) for p in ps for q in ps if not (ps[p] & ps[q])]


# same, but avoiding mirrored pairs, only compute above the diagonal of the product
def pair_strs4(ps):
    ps = {p: set(p) for p in ps}
    keys = list(ps.keys())
    return [(p, q) for i, p in enumerate(ps) for q in keys[i 1:] if not (ps[p] & ps[q])]


# the same again, but now avoiding the lookup in the dictionaries
def pair_strs5(ps):
    ps = {p: set(p) for p in ps}
    items = list(ps.items())
    return [(p, q) for i, (p, p_set) in enumerate(ps.items()) for q, q_set in items[i 1:] if not (p_set & q_set)]


# different approach, creating the product (upper half above diagonal) first
def pair_strs6(ps):
    ps = {p: set(p) for p in ps}
    ps = combinations(ps.items(), 2)
    return [(p, q) for (p, p_set), (q, q_set) in ps if not (p_set & q_set)]


# the same as 5, but with integer bitmasks instead of sets, as per @kellybundy's suggestion
def pair_strs7(ps):
    ps = {p: sum(1 << (ord(c) - 97) for c in p) for p in ps}
    items = list(ps.items())
    return [(p, q) for i, (p, p_mask) in enumerate(ps.items()) for q, q_mask in items[i 1:] if not (p_mask & q_mask)]


# run some tests
n = 10
print(timeit(lambda: pair_strs1(strings), number=n))
print(timeit(lambda: pair_strs3(strings), number=n))
print(timeit(lambda: pair_strs4(strings), number=n))
print(timeit(lambda: pair_strs5(strings), number=n))
print(timeit(lambda: pair_strs6(strings), number=n))
print(timeit(lambda: pair_strs7(strings), number=n))


p1 = pair_strs1(strings)
p3 = pair_strs3(strings)
p4 = pair_strs4(strings)
p4  = [tuple(reversed(p)) for p in p4]
p5 = pair_strs5(strings)
p5  = [tuple(reversed(p)) for p in p5]
p6 = pair_strs6(strings)
p6  = [tuple(reversed(p)) for p in p6]
p7 = pair_strs7(strings)
p7  = [tuple(reversed(p)) for p in p7]

print(set(p1) == set(p3) == set(p4) == set(p5) == set(p6) == set(p7) and
      len(p1) == len(p3) == len(p4) == len(p5) == len(p6) == len(p7))

Result for me:

3.4115294000002905
1.7107413000012457
0.8492871999987983
0.7485178000024462
0.7776397999987239
0.37149049999788986
True

So, the final change doesn't really help - computing the product beforehand doesn't save enough. pair_pairs5 appears to be preferable.

Edit: changed the generation of strings to including a size (and set the default to 3 instead of 2), adjusted naming accordingly and avoiding duplicate characters in strings (as per OP's description).

This also makes user @KellyBundy's solution the fastest, since the lack of duplicates allows for a very clean integer-based solution, #7. This is the clear winner under these conditions (about a 9x speed-up).

CodePudding user response：

I think that you could track the letter set for each array element. Then, store the reversed pair(e.g., (x,i)) not to visit the entry again as follows:

if __name__ == '__main__':
    arr4 = ['ab','cd','ac','de']
    string_dict = {}
    for a_str in arr4:
        if a_str not in string_dict:
            string_dict[a_str] = set()
        for a_letter in a_str:
            string_dict[a_str].add(a_letter)
    print(string_dict)
    visited = set()
    for i in arr4:
        for x in arr4:
            if not (string_dict[i] & string_dict[x]):
                cur_pair = (i, x)
                if cur_pair in visited:
                    continue
                print(cur_pair)
                reverse_pair = (x, i)
                visited.add(cur_pair)
                visited.add(reverse_pair)

Result:

{'ab': {'a', 'b'}, 'cd': {'c', 'd'}, 'ac': {'a', 'c'}, 'de': {'d', 'e'}}
('ab', 'cd')
('ab', 'de')
('ac', 'de')