Regex - occurrences of a batch of keywords in a text-CodePudding

I'm doing keyword extraction on documents.

Entries are :

thousands of documents (up to 2GB in size)
about ~200k keywords aggregated by categories

As of now, for every document, we search every keyword one by one, which I think is inefficient.

So I thought about compiling regexes by category of keywords using pipes:

import re

text = """
Contrary to popular belief, Lorem Ipsum is not simply random text.
It has roots in a piece of classical Latin literature from 45 BC,
making it over 2000 years old.
Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia,
looked up one of the more obscure Latin words,
consectetur, from a Lorem Ipsum passage,
and going through the cites of the word in classical literature,
discovered the undoubtable source.
Lorem Ipsum comes from sections 1.10.32 and 1.10.33 of
"de Finibus Bonorum et Malorum" (The Extremes of Good and Evil) by Cicero,
written in 45 BC. This book is a treatise on the theory of ethics,
very popular during the Renaissance. The first line of Lorem Ipsum,
"Lorem ipsum dolor sit amet..", comes from a line in section 1.10.32. 
"""

regexes = [
    r'(?P<Writing__book>book)',
    r'(?P<Writing__word>word)',
    r'(?P<Writing__latin>latin)',
    r'(?P<Writing__text>text)',
    r'(?P<Writing__literature>literature)',
    r'(?P<Cities__virginia>virginia)',
    r'(?P<Genre__classical>classical)',
    r'(?P<Genre__renaissance>renaissance)',
]
compiled_regex = '|'.join(regexes)
results = re.findall(
        compiled_regex,
        text,
        flags=re.MULTILINE | re.IGNORECASE
    )
for result in results:
    print(result)

This prints:

('', '', '', 'text', '', '', '', '')
('', '', '', '', '', '', 'classical', '')
('', '', 'Latin', '', '', '', '', '')
('', '', '', '', 'literature', '', '', '')
('', '', 'Latin', '', '', '', '', '')
('', '', '', '', '', 'Virginia', '', '')
('', '', 'Latin', '', '', '', '', '')
('', 'word', '', '', '', '', '', '')
('', 'word', '', '', '', '', '', '')
('', '', '', '', '', '', 'classical', '')
('', '', '', '', 'literature', '', '', '')
('book', '', '', '', '', '', '', '')
('', '', '', '', '', '', '', 'Renaissance')

What I'd like to get is a dictionary with each category__keyword and the number of occurrences, like:

{'Writing__book': 1, 'Writing__word': 2, 'Cities__virginia': 1, ...}

CodePudding user response：

Here is a solution you can try,

import re

from collections import defaultdict

text = """..."""

regexes = ["..."]

compiled_regex = '|'.join(regexes)

results = re.finditer(  # <-- Change to finditer, which returns a iterator (efficient on large data)
    compiled_regex,
    text,
    flags=re.MULTILINE | re.IGNORECASE
)

word_counts = defaultdict(int)  # <-- Default dict to track counts

for result in results:
    for key_, value_ in result.groupdict().items():  # <-- Use group dict, since the you have named capturing group
        if value_:
            word_counts[key_]  = 1

print(word_counts)

defaultdict(<class 'int'>, {'Writing__text': 1, 'Genre__classical': 2, 'Writing__latin': 3, 'Writing__literature': 2, 'Cities__virginia': 1, 'Writing__word': 2, 'Writing__book': 1, 'Genre__renaissance': 1})

CodePudding user response：

You could remove all punctuation from your text and lowercase it; then split it to a word list and count words using a Counter. Then use a comprehension over a dictionary of words in each category to build your desired result:

from collections import Counter

words = { 'Writing' : ['word', 'book', 'latin', 'text', 'literature'],
          'Cities' : ['virginia'],
          'Genre' : ['classical', 'renaissance']
        }
counts = Counter(re.split(r'\s*[^a-z0-9]', text.lower()))
result = { f'{k}__{w}' : counts[w] for k, v in words.items() for w in v }

Output:

{
    "Writing__word": 1,
    "Writing__book": 1,
    "Writing__latin": 3,
    "Writing__text": 1,
    "Writing__literature": 2,
    "Cities__virginia": 1,
    "Genre__classical": 2,
    "Genre__renaissance": 1
}

Better yet, produce a dict of dict of counts:

result = { k : { w : counts[w] for w in v } for k, v in words.items() }

Output:

{
    "Writing": {
        "word": 1,
        "book": 1,
        "latin": 3,
        "text": 1,
        "literature": 2,
        "fred": 0
    },
    "Cities": {
        "virginia": 1
    },
    "Genre": {
        "classical": 2,
        "renaissance": 1
    }
}

Performance wise for the sample data and search words, this method is about 45% faster than the method proposed using regexes (100k iterations using timeit came to 4.77s vs 9.12s). I would expect that as the complexity of the search increased, this advantage would improve.

CodePudding user response：

(untested) But I would do something like:

# May need to remove other punctuation here using .replace()
input_as_list = text.split(" ").replace(",", "").replace(".", "").replace("(", "").replace('"', "")

# Add any desired words here
words_to_find = ["book", "word", "latin"]

# Output dict
output = {}

for word in words_to_find:
    output[word] = input_as_list.count(word)

print(output)

This will return something that looks like:

{"book": 7, "word": 5, "latin": 3}

Using Python built-in string methods is recommended over regex as their behavior is more clear.