I've got this question. I should split word to ngrams (for example: word ADVENTURE has three 4grams - ADVE; ENTU; TURE). There is a book file document (that's the reason for counter and isalpha), which is I don't have here, so I'm using only a list of 2 words. This is my code in Python:
words = ['adven', 'adventure']
def ngrams(words, n):
counter = {}
for word in words:
if (len(word)-1) >= n:
for i in range(0, len(word)):
if word.isalpha() == True:
ngram = ""
for i in range(len(word)):
ngram = word[i:n:]
if len(ngram) == n:
ngram.join(counter)
counter[ngram] = counter.get(ngram, 0) 1
return counter
print(trotl(words, 4))
This is what the code gives me:
{'adve': 14}
I don't care about the values in it but I'm not so good at strings and I don't know what I should do to gives me the three 4grams. I try to do "ngram = word[i::]" but that gives me None. Please help me, this is my school homework and I can't do more functions when this ngrams doesn't work.
CodePudding user response:
use nltk.ngrams
for this job:
from nltk import ngrams
CodePudding user response:
I think the definition you have of n-grams is a little bit different from the conventional, as pointed out by @Stuart in his comment. However, with the definition from your comment, I think the following would solve your problem.
def n_grams(word, n):
# We can't find n-grams if the word has less than n letters.
if n > len(word):
return []
output = []
start_idx = 0
end_idx = start_idx n
# Grab all n-grams except the last one
while end_idx < len(word):
n_gram = word[start_idx:end_idx]
output.append(n_gram)
start_idx = end_idx - 1
end_idx = start_idx n
# Grab the last n-gram
last_n_gram_start = len(word) - n
last_n_gram_end = len(word)
output.append(word[last_n_gram_start:last_n_gram_end])
return output
CodePudding user response:
If I've understood the rules correctly, you can do it like this
def special_ngrams(word, n):
""" Yield character ngrams of word that overlap by only one character,
except for the last two ngrams which may overlap by more than one
character. The first and last ngrams of the word are always included. """
for start in range(0, len(word) - n, n - 1):
yield word[start:start n]
yield word[-n:]
for word in "hello there this is a test", "adventure", "tyrannosaurus", "advent":
print(list(special_ngrams(word, 4)))