I'm working with large files, in this case a file that has one word per line and over 300k lines. I'm trying to find a way to obtain the most common patterns present in the words of the file. For example, if I treat it as a list (small example)
a = [122, pass123, dav1, 1355122]
it should recognize "122" is commonly used.
It is important to do it efficiently because otherwise the processing time will be too much taking into account the number of words to check.
I have tried this, which I saw from this post Python finding most common pattern in list of strings, but in my case it only displays the most common characters in the file:
matches = Counter(reduce(lambda x,y: x y ,map (lambda x : x ,list_of_words))).most_common()
where list_of_words is a list containing all the words in the file.
Is there any way to obtain string matches starting from 3 characters instead of only getting one char?
Thank you all for your help :)
CodePudding user response:
I tried this out:
def catalogue_patterns(word, min_len, max_len):
n_chars = len(word)
patterns = Counter()
for start in range(n_chars - min_len):
for end in range(
start min_len, min(start max_len 1, n_chars)):
seq = word[start:end]
patterns[seq] = 1
return patterns
which for catalogue_patterns('abcabcd', 3, 5)
returns:
Counter({'abc': 2,
'abca': 1,
'abcab': 1,
'bca': 1,
'bcab': 1,
'bcabc': 1,
'cab': 1,
'cabc': 1})
Then
def catalogue_corpus(corpus, min_len, max_len):
patterns = Counter()
for word in corpus:
patterns = catalogue_patterns(word, min_len, max_len)
return patterns
patterns = catalogue_corpus(corpus, 3, 5)
print(patterns.most_common())
(where corpus
would be your list of words). I ran it on a list of 100,000 artificially generated words and it took about 19s. In a real corpus, where certain words are repeated frequently, you can memoize the function for additional speed. You can do this easily in python using lru_cache
:
from functools import lru_cache
@lru_cache()
def catalogue_patterns_memoized(word, min_len, max_len):
n_chars = len(word)
patterns = Counter()
for start in range(n_chars - min_len):
for end in range(
start min_len, min(start max_len 1, n_chars)):
seq = word[start:end]
patterns[seq] = 1
return patterns
If speed is really an issue though, you can get much faster speeds doing this in C (or Cython) instead.
As a side note:
Counter(
reduce(
lambda x, y: x y,
map(lambda word: catalogue_patterns(word, 3, 5), corpus)))
took about 8x as long.
My artificial test corpus was generated using:
import numpy as np
def generate_random_words(n, mean_len=5):
probs = np.array([10, 2, 6, 4, 20, 3, 4])
probs = probs / probs.sum()
return [
''.join(
np.random.choice(
list('abcdefg'),
size=np.random.poisson(mean_len),
p=probs))
for _ in range(n)]
corpus = generate_random_words(100_000, 5)
print(corpus[:10])
CodePudding user response:
Try nltk.probability.FreqDist
module to find the number of times each token has occurred:
import nltk
n = 3
with open('your.txt') as f:
tokens = nltk.tokenize.word_tokenize(f.read())
freq_dist = nltk.FreqDist(t.lower() for t in tokens)
most_common = [(w, c) for w, c in freq_dist.most_common() if len(w) == n and c > 1]
print(most_common)
The output for your current file:
[('new', 9), ('san', 5), ('can', 2), ('not', 2), ('gon', 2), ('las', 2), ('los', 2), ('piu', 2), ('usc', 2)]