Home > Net >  Custom pattern to match phrases in spacy's Matcher
Custom pattern to match phrases in spacy's Matcher

Time:11-15

i'm trying to use spacy to match some sample sentences. I tried the sample code successfully, but now i need something more specifically. First the sample code so that you understand better:

import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
# Add match ID "HelloWorld" with no callback and one pattern
pattern = [{"LOWER": "hello"}, {"IS_PUNCT": True}, {"LOWER": "world"}]
matcher.add("HelloWorld", [pattern])

doc = nlp("Hello, world! Hello world!")
matches = matcher(doc)
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  # Get string representation
    span = doc[start:end]  # The matched span
    print(match_id, string_id, start, end, span.text)

This works very well, but i need something more specific now: I need python to load phrases from a file (each sentence is different from another) and store it in memory and then look if phrase1 (Hello, world! Hello world! in the example) contains any of the patterns in memory. Is this possible? If yes, could someone help or guide me please I really don't know how to proceed. Thank you very much!!

CodePudding user response:

If I understand correctly, you want to:

  1. Read an external file that contains, among other things, the string to match, which in your case is Hello, world!
  2. Look for your pattern inside the loaded file.
  3. Return the pattern as you do above.

This should work:

# File contents:
"""./myfile.txt
This is one sentence. Hello world! This is another sentence.
Yet another sentence. Hello world... Hello, world!
"""

import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
# Add match ID "HelloWorld" with no callback and one pattern
pattern = [{"LOWER": "hello"}, {"IS_PUNCT": True}, {"LOWER": "world"}]
matcher.add("HelloWorld", [pattern])

matches = matcher(doc)

# Load file as string into memory: https://java2blog.com/python-read-file-into-string/
with open('myfile.txt') as f:
    doc = nlp(f.read())

# Use the pipeline's sentence recognizer: https://spacy.io/usage/linguistic-features#sbd
for sent in doc.sents:
    matches = matcher(sent)
    # From your code, just replace `doc` by `sent`
    for match_id, start, end in matches:
        string_id = nlp.vocab.strings[match_id]  # Get string representation
        span = sent[start:end]  # The matched span
        print(match_id, string_id, start, end, span.text)

Notice that if your file is too large you'd probably want to read it line by line as in:

with open(file) as f:
    for line in f:
    # do your stuff here
  • Related