Improve speed of python algorithm-CodePudding

I have used Sentiment140 dataset for twitter for sentiment analysis

Code:

getting words from tweets:

tweet_tokens = []
[tweet_tokens.append(dev.get_tweet_tokens(idx)) for idx, item in enumerate(dev)]

getting unknown words from tokens

words_without_embs = []
[[words_without_embs.append(w) for w in tweet if w not in word2vec] for tweet in tweet_tokens]
len(words_without_embs)

last part of code, calculate vector as the mean of left and right words (context)

vectors = {} # alg
for word in words_without_embs:
  mean_vectors = []
  for tweet in tweet_tokens:
    if word in tweet:
      idx = tweet.index(word)
      try:
        mean_vector = np.mean([word2vec.get_vector(tweet[idx-1]), word2vec.get_vector(tweet[idx 1])], axis=0)
        mean_vectors.append(mean_vector)
      except:
        pass

    if tweet == tweet_tokens[-1]: # last iteration
      mean_vector_all_tweets = np.mean(mean_vectors, axis=0)
      vectors[word] = mean_vector_all_tweets

There are 1058532 words and the last part of this code works very slow, about 250 words per minute.

How can you improve the speed of this algorithm?

CodePudding user response：

One of the main reasons for your slow code is checking the existence of all words (near 1 million words) for each tweet in tweet_tokens. Hence, the time complexity of your implementation is 1e6 * |tweet_tokens|.

1) First Improvement (Reducing Search and Comparisons)

However, you can do it much better by tokenizing each tweet first, then finding the index of the word. If you built one dictionary over existing words, you can find the index of the word token with at most log(1e6) ~ 25 comparison from the word dictionary. Hence, in that case, the time complexity will be at most 25 * |tweet_tokens|. Therefore, you can improve the performance of your code 1e6/25 = 40000 times faster!

2) Second Improvment (Reducing Word2Vec Computations)

Moreover, you are always computing the vector of the same word in different tweets. Hence, the vector of each word will be computed f-times that f is the frequency of the word in tweets. A rational solution is computing the vector of all words in words_without_embs one time (it can be an offline process). Then, store all of these vectors based on the index of words in the word dictionary, for example (somehow to find them fast based on the word query). Eventually, merely read it from the prepared data structure for averaging computation. In that case, in addition to 40000 times improvement, you can improve the performance of your code by the factor of the sum of all words' frequencies in tweets.

CodePudding user response：

This looks like it would parallelize well with some minor edits.

Can you move that last if tweet == tweet_tokens[-1]: block up a level and remove the "if" statement for it? That by itself would only give a minor speed increase, but with it in place, you could parallelize the code more effectively.
Consider making the inner for loop its own function and calling it via a multithreaded setup. (See ThreadPoolExecutor and thread_function() at this site for description and examples.) You could then have a separate thread for each processor on a single machine - or for each VM in a distributed environment. This should allow more effective scaling.
Building on @DarrylG 's comment above, restructure those list-comprehensions to avoid using .append() with an existing list. The .append() is slower than an equivalent list comprehension would be. For example, change [tweet_tokens.append(dev.get_tweet_tokens(idx)) for idx, item in enumerate(dev)] to tweet_tokens = [dev.get_tweet_tokens(idx) for idx, item in enumerate(dev)].

CodePudding user response：

More-common (& probably better) strategies for dealing with unknown words include:

training/using a model, like FastText, that can offer guess-vectors for out-of-vocabulary (OOV) words
acquiring more training data, so vectors for more unknown words can be learned from real usages
ignoring unknown words entirely

It seems you've decided to instead synthesize new vectors for OOV words, by averaging all of their immediate neighbors. I don't think this would work especially well. In many kinds of downstream uses of the word-vectors, it just tends to overweight the word's in-context neighbors – which can also be very simply/cheaply achieved by just ignoring the unknown word entirely.

But given what you want to do, the best approach would be to collect the neighboring words during the same pass that identifies the words_without_embs.

For example, make words_without_embs a dict (or perhaps a DefaultDict), where each key is a word that will need a vector, and each value is a list of all the neighboring words you've found so far.

Then, one loop over the tweet_tokens would both fill the words_without_embs with keys that are the words needing vectors, while stuffing those values with all the neighboring-words seen so far.

Then, one last loop over the words_without_embs keys would simply grab the existing lists of neighbor-words for the averaging. (No more multiple passes over tweet_tokens.)

But again: all this work might not outperform the baseline practice of simply dropping unknown words.