Home > Net >  Set comparison optimization
Set comparison optimization

Time:01-10

Description

I have two large lists of sets

A = [ {...}, ..., {...} ]
B = [ {...}, ..., {...} ]

I'm performing a very cost-intensive list comprehension that for every element in every set in A checks if there is a match with any element in B's sets and if so returns B's respective sets.

[find_sets(i) for i in A]

Example

A minimal example looks like this:

import secrets

# create sample data 
def generate_random_strings(num_strings, string_length):
    random_strings = []
    for i in range(num_strings):
        random_strings.append(secrets.token_hex(string_length))
    random_strings = set(random_strings)
    return random_strings

A = [generate_random_strings(5, 1) for i in range(10000)]
B = [generate_random_strings(5, 1) for i in range(10000)]

# set checker 
def find_sets(A):
    matching_sets = []
    for b_set in B:
        if A & b_set:
            matching_sets.append(b_set)
    return matching_sets

result = [find_set(i) for i in A]

Multiprocessing

It's obviously faster on all my 32 CPU cores:

from tqdm.contrib.concurrent import process_map

pool = multiprocessing.Pool(processes=32)
results = process_map(find_sets, A, chunksize=100)

Problem

While for a few thousand elements for A and B the list comprehension runs fairly fast on my machine and multiprocessing helps to scale it up to like 50.000 elements, it becomes very slow for 500.000 elements in each list which is my actual size.

Is there any way to speed up my function code-wise with vectorization, hashing the sets before or working with some kind of optimized data types (frozensets didn't help)?

CodePudding user response:

This runs an order of magnitude faster in my tests:

import collections

reverse_map = collections.defaultdict(set)
for idx, elements in enumerate(B):
    for element in elements:
        reverse_map[element].add(idx)

def find_sets(A):
    union = set()
    emptyset = set()
    for element in A:
        union |= reverse_map.get(element, emptyset)
    return [B[idx] for idx in union]

CodePudding user response:

You could use a comprehension instead of find_sets:

 result = [[b for b in B if a & b] for i in A]
  • Related