Filter parameter by length for pattern recognition in python-CodePudding

I'm working with large files, in this case a file that has one word per line and over 300k lines. I'm trying to find a way to obtain the most common patterns present in the words of the file. For example, if I treat it as a list (small example)

a = [122, pass123, dav1, 1355122] it should recognize "122" is commonly used.

It is important to do it efficiently because otherwise the processing time will be too much taking into account the number of words to check.

I have tried this, which I saw from this post Python finding most common pattern in list of strings, but in my case it only displays the most common characters in the file:

matches = Counter(reduce(lambda x,y: x y ,map (lambda x : x ,list_of_words))).most_common() where list_of_words is a list containing all the words in the file.

Is there any way to obtain string matches starting from 3 characters instead of only getting one char?

Thank you all for your help :)

CodePudding user response：

I tried this out:

def catalogue_patterns(word, min_len, max_len):
    n_chars = len(word)
    patterns = Counter()
    for start in range(n_chars - min_len):
        for end in range(
                start   min_len, min(start   max_len   1, n_chars)):
            seq = word[start:end]
            patterns[seq]  = 1
    return patterns

which for catalogue_patterns('abcabcd', 3, 5) returns:

Counter({'abc': 2,
         'abca': 1,
         'abcab': 1,
         'bca': 1,
         'bcab': 1,
         'bcabc': 1,
         'cab': 1,
         'cabc': 1})

Then

def catalogue_corpus(corpus, min_len, max_len):
    patterns = Counter()
    for word in corpus:
        patterns  = catalogue_patterns(word, min_len, max_len)
    return patterns

patterns = catalogue_corpus(corpus, 3, 5)
print(patterns.most_common())

(where corpus would be your list of words). I ran it on a list of 100,000 artificially generated words and it took about 19s. In a real corpus, where certain words are repeated frequently, you can memoize the function for additional speed. You can do this easily in python using lru_cache:

from functools import lru_cache

@lru_cache()
def catalogue_patterns_memoized(word, min_len, max_len):
    n_chars = len(word)
    patterns = Counter()
    for start in range(n_chars - min_len):
        for end in range(
                start   min_len, min(start   max_len   1, n_chars)):
            seq = word[start:end]
            patterns[seq]  = 1
    return patterns

If speed is really an issue though, you can get much faster speeds doing this in C (or Cython) instead.

As a side note:

Counter(
    reduce(
        lambda x, y: x   y,
        map(lambda word: catalogue_patterns(word, 3, 5), corpus)))

took about 8x as long.

My artificial test corpus was generated using:

import numpy as np

def generate_random_words(n, mean_len=5):
    probs = np.array([10, 2, 6, 4, 20, 3, 4])
    probs = probs / probs.sum()
    return [
        ''.join(
            np.random.choice(
                list('abcdefg'),
                size=np.random.poisson(mean_len),
                p=probs))
        for _ in range(n)]

corpus = generate_random_words(100_000, 5)
print(corpus[:10])

CodePudding user response：

Try nltk.probability.FreqDist module to find the number of times each token has occurred:

import nltk

n = 3
with open('your.txt') as f:
    tokens = nltk.tokenize.word_tokenize(f.read())
    freq_dist = nltk.FreqDist(t.lower() for t in tokens)

    most_common = [(w, c) for w, c in freq_dist.most_common() if len(w) == n and c > 1]
    print(most_common)

The output for your current file:

[('new', 9), ('san', 5), ('can', 2), ('not', 2), ('gon', 2), ('las', 2), ('los', 2), ('piu', 2), ('usc', 2)]