Home > OS >  How to split word to ngrams in Python?
How to split word to ngrams in Python?

Time:12-19

I've got this question. I should split word to ngrams (for example: word ADVENTURE has three 4grams - ADVE; ENTU; TURE). There is a book file document (that's the reason for counter and isalpha), which is I don't have here, so I'm using only a list of 2 words. This is my code in Python:

words = ['adven', 'adventure']
def ngrams(words, n):
    counter = {} 
    for word in words:
        if (len(word)-1) >= n:
            for i in range(0, len(word)):
                if word.isalpha() == True:
                    ngram = ""
                    for i in range(len(word)):
                            ngram  = word[i:n:]
                            if len(ngram) == n:
                                ngram.join(counter)
                                counter[ngram] = counter.get(ngram, 0)   1
    return counter

print(trotl(words, 4))

This is what the code gives me:
{'adve': 14}

I don't care about the values in it but I'm not so good at strings and I don't know what I should do to gives me the three 4grams. I try to do "ngram = word[i::]" but that gives me None. Please help me, this is my school homework and I can't do more functions when this ngrams doesn't work.

CodePudding user response:

use nltk.ngrams for this job:

from nltk import ngrams

CodePudding user response:

I think the definition you have of n-grams is a little bit different from the conventional, as pointed out by @Stuart in his comment. However, with the definition from your comment, I think the following would solve your problem.

def n_grams(word, n):

    # We can't find n-grams if the word has less than n letters.
    if n > len(word):
        return []

    output = []
    start_idx = 0
    end_idx = start_idx   n

    # Grab all n-grams except the last one
    while end_idx < len(word):
        n_gram = word[start_idx:end_idx]
        output.append(n_gram)
        start_idx = end_idx - 1
        end_idx = start_idx   n

    # Grab the last n-gram
    last_n_gram_start = len(word) - n
    last_n_gram_end = len(word)
    output.append(word[last_n_gram_start:last_n_gram_end])

    return output

CodePudding user response:

If I've understood the rules correctly, you can do it like this

def special_ngrams(word, n):
    """ Yield character ngrams of word that overlap by only one character, 
        except for the last two ngrams which may overlap by more than one 
        character. The first and last ngrams of the word are always included. """
    for start in range(0, len(word) - n, n - 1):
        yield word[start:start   n]
    yield word[-n:]

for word in "hello there this is a test", "adventure", "tyrannosaurus", "advent":
    print(list(special_ngrams(word, 4)))
        
  • Related