I have 2 massive lists of strings:
One that has names of chemicals (like 10K chemicals):
chemicals_list = ["chemical1", "chemical2", ..., "chemical100000"]
And another that contains articles abstracts (like 50M abstracts):
abstracts_list = ["abstract1 is very very very long", "abstract2 is very very very VERY long", ..., "abstract50000000 is pretty long as well"]
I need to create a frequency dictionary that maps each chemical from the chemicals_list
to in how many abstracts it appeared in.
Currently I have 2 for loops, but this is taking forever:
frequency_dict = {}
for c in chemicals_list:
exact_entity = f' {c} ' # make sure it's the exact entity since it can appear as a substring (e.g., "pen" rather then "penicilin")
for abstract_text in abstracts_list:
if exact_entity in abstract_text:
if c in frequency_dict.keys():
frequency_dict[c] = 1
else:
frequency_dict[c] = 1
Is there a more efficient way to do this? I have access to a GPU if that helps
CodePudding user response:
I optimized your code the following way:
import random
import string
import time
chemicals_list = ["chemical1", "chemical2", ..., "chemical100000"]
abstracts_list = ["abstract1 is very very very long", "abstract2 is very very very VERY long", ..., "abstract50000000 is pretty long as well"]
frequency_dict = {}
text = ' '.join(abstracts_list) # make one big string
for c in chemicals_list:
# you might want to consider fuzzy word matching (see fuzzywuzzy python lib)
frequency_dict[c] = text.count(c)
I ran a test locally and I saw a massive speed improvement in my test case. It's a good idea to avoid python loops if you want perfomance. There might even exist some function so that no python loop is needed, but I didn't really search. Also try using numpy/scipy where you can so you can use precompiled c functions. When you have tried all of that, then you can start thinking of multithreading.
Also consider posting this on https://codereview.stackexchange.com/ where it's more appropriate to ask for a review/improvement.
CodePudding user response:
You could use collections.Counter
along with generator expression
class collections.Counter([iterable-or-mapping])
A
Counter
is adict
subclass for counting hashable objects. It is a collection where elements are stored as dictionary keys and their counts are stored as dictionary values. Counts are allowed to be any integer value including zero or negative counts. TheCounter
class is similar to bags or multisets in other languages.
exact_identities = map(lambda x:f' {x} ',chemical_list)
#note that abs_list != abs_text_conditioned
abs_texts_conditioned = (abstract_text
for exact_identity in exact_identities
for abstract_text in abstracts_list
if exact_entity in abstract_text)
freq_counted = Counter(abs_text_conditioned)