I'm working with a dictionary, trying to find all values (repetitons of words in a text) above 1 and store them into a list with this function :
def get_repetitions(text):
n_grams_lengths = [1,2,3,4,5,6]
ngrams_count = {}
for n in n_grams_lengths:
ngrams = tuple(nltk.ngrams(text.split(' '), n=n))
ngrams_count.update({' '.join(i) : ngrams.count(i) for i in ngrams})
reps_list = []
reps_variables = {values for (values) in ngrams_count.values() if values > 1}
reps_list.append(reps_variables)
return reps_list
When I do this, however, I get the list of values found in the dictionary, but not how many times they appear. How would I go about getting this?
Also, say the value "2" is in the dictionary 3 times, and the value "5", 4 times, would there be a way of getting something like this: 2,2,2,5,5,5,5?
CodePudding user response:
If 'text' is set to some str
value,containing some text, then:
text=text.split()
result={i:text.count(i) for i in text if text.count(i)>1}
However, by default str.split()
will separate the string with any whitespace characters. Depending on the text, this may not be as accurate as one would hope.
If you have a dictionary with words as keys and numbers of their occurrences as values, the solution to the second question can be done as follows:
result=' '.join(word for word in dictionary for _ in range(dictionary[word]))
CodePudding user response:
Your issue is that you already have a dictionary containing the words and their frequency, but you just extract the words themselves, ignoring the frequencies. Instead of doing that, you just need to filter ngrams_count
:
ngrams_count = {"car": 5, "bob": 1, "foo": 3}
reps_variables = dict(filter(lambda elem: elem[1] > 1, ngrams_count.items()))
reps_variables
>>> {"car": 5, "foo": 3}
Then, for the second part of your question, we can do this:
frequencies = itertools.chain(*[[k] * v for k, v in reps_variables.items()])
frequencies
>>> ["car", "car", "car", "car", "car", "foo", "foo", "foo"]