Efficient way to check if every string in a list appear in another list of strings-CodePudding

I have 2 massive lists of strings:

One that has names of chemicals (like 10K chemicals):

chemicals_list = ["chemical1", "chemical2", ..., "chemical100000"]

And another that contains articles abstracts (like 50M abstracts):

abstracts_list = ["abstract1 is very very very long", "abstract2 is very very very VERY long", ..., "abstract50000000 is pretty long as well"]

I need to create a frequency dictionary that maps each chemical from the chemicals_list to in how many abstracts it appeared in.

Currently I have 2 for loops, but this is taking forever:

frequency_dict = {}
for c in chemicals_list:
    exact_entity = f' {c} ' # make sure it's the exact entity since it can appear as a substring (e.g., "pen" rather then "penicilin")
    for abstract_text in abstracts_list: 
        if exact_entity in abstract_text:
            if c in frequency_dict.keys():
                frequency_dict[c]  = 1
            else: 
                frequency_dict[c] = 1

Is there a more efficient way to do this? I have access to a GPU if that helps

CodePudding user response：

I optimized your code the following way:

import random
import string
import time

chemicals_list = ["chemical1", "chemical2", ..., "chemical100000"]

abstracts_list = ["abstract1 is very very very long", "abstract2 is very very very VERY long", ..., "abstract50000000 is pretty long as well"]

frequency_dict = {}
text = ' '.join(abstracts_list)  # make one big string
for c in chemicals_list:
    # you might want to consider fuzzy word matching (see fuzzywuzzy python lib)
    frequency_dict[c] = text.count(c)

I ran a test locally and I saw a massive speed improvement in my test case. It's a good idea to avoid python loops if you want perfomance. There might even exist some function so that no python loop is needed, but I didn't really search. Also try using numpy/scipy where you can so you can use precompiled c functions. When you have tried all of that, then you can start thinking of multithreading.

Also consider posting this on https://codereview.stackexchange.com/ where it's more appropriate to ask for a review/improvement.

CodePudding user response：

You could use collections.Counter along with generator expression

class collections.Counter([iterable-or-mapping])

A Counter is a dict subclass for counting hashable objects. It is a collection where elements are stored as dictionary keys and their counts are stored as dictionary values. Counts are allowed to be any integer value including zero or negative counts. The Counter class is similar to bags or multisets in other languages.

exact_identities = map(lambda x:f' {x} ',chemical_list)
#note that abs_list != abs_text_conditioned
abs_texts_conditioned = (abstract_text
                         for exact_identity in exact_identities
                         for abstract_text in abstracts_list
                         if exact_entity in abstract_text)
freq_counted = Counter(abs_text_conditioned)