My code does execute for small sample but not for a large-CodePudding

I try to count the frequency of word occurances in a variable. The variables counts more than 700.000 observations. The output should return a dictionary with the words that occured the most. I used the code below to do this:

d1 = {}
for i in range(len(words)-1):
    x=words[i]
    c=0
    for j in range(i,len(words)):
        c=words.count(x)
    count=dict({x:c})
    if x not in d1.keys():
        d1.update(count)

I've runned the code for the first 1000 observations and it worked perfectly. The output is shown below:

[('semantic', 23),
 ('representations', 11),
 ('models', 10),
 ('task', 10),
 ('data', 9),
 ('parser', 9),
 ('language', 8),
 ('languages', 8),
 ('paper', 8),
 ('meaning', 8),
 ('rules', 8),
 ('results', 7),
 ('performance', 7),
 ('parsing', 7),
 ('systems', 7),
 ('neural', 6),
 ('tasks', 6),
 ('entailment', 6),
 ('generic', 6),
 ('te', 6),
 ('natural', 5),
 ('method', 5),
 ('approaches', 5)]

When I try to run it for 100.000 observations, it keeps running. I've tried it for more than 24 hours and it still doesn't execute. Does anyone have an idea?

CodePudding user response：

You can use collections.Counter.

from collections import Counter

counts = Counter(words)
print(counts.most_common(20))

CodePudding user response：

@Jon answer is the best in your case, however in some cases collections.counter will be slower than iteration. (specially if afterwards you don't need to sort by frequency) as I asked in this question

You can count frequencies by iteration.

d1 = {}
for item in words:
  if item in d1.keys():
    d1[item]  = 1
  else:
    d1[item] = 1

# finally sort the dictionary of frequencies
print(dict(sorted(d1.items(), key=lambda item: item[1])))

But again, for your case, using @Jon answer is faster and more compact.

CodePudding user response：

#...
for i in range(len(words)-1):
    #...
    #...
    for j in range(i,len(words)):
        c=words.count(x)
    #...
    if x not in d1.keys():
        #...

I've tried to highlight the problems your code is having above. In english this looks something like:

"Count the number of occurences of each word after the word I'm looking at, repeatedly, for every word in the whole list. Also, look through the whole dictioniary I'm building again for every word in the list, while I'm building it."

This is way more work than you need to do; you only need to look at each word in the list once. You do need to look in the dictionary once for every word, but looking at d1.keys() makes this far slower by converting the dictionary to another list and looking through the whole thing. The following code will do what you want, much more quickly:

words = ['able', 'baker', 'charlie', 'dog', 'easy', 'able', 'charlie', 'dog', 'dog']

word_counts = {}

# Look at each word in our list once
for word in words:
    # If we haven't seen it before, create a new count in our dictionary
    if word not in word_counts:
        word_counts[word] = 0

    # We've made sure our count exists, so just increment it by 1
    word_counts[word]  = 1

print(word_counts.items())

The above example will give:

[
    ('charlie', 2),
    ('baker', 1),
    ('able', 2),
    ('dog', 3),
    ('easy', 1)
]