Home > Blockchain >  spacy matcher pattern IN REGEX Tag
spacy matcher pattern IN REGEX Tag

Time:11-10

My goal is to match with spacy the sentences that contain one of the following words: ['studium','abschluss','ausbildung']

I can solve the problem with this line:

pattern = [{"LOWER": {'IN':['studium','abschluss', 'ausbildung']}}]

My problem is that in German there is a vast use of composed words like Hochschulstudium, Masterstudium, Studiengang etc.

How can use the regex inside the IN sentence to match all words containing the word Studium?

CodePudding user response:

You can use the REGEX operator:

import re
l = ['abschluss', 'ausbildung']
pattern = [{'LOWER': {'REGEX':fr'^(?:{"|".join(map(re.escape, l))}|[^\W\d_]*studium)$'}}]

Note:

  • map(re.escape, l) - escapes the items in the l list
  • "|".join(...) - joins the words as alternatives (word1|word2|wordN)
  • ^(?:...|[^\W\d_]*studium)$ - a regex that matches
    • ^ - start of string (here, token)
    • (?:...|[^\W\d_]*studium) - a non-capturing group matching any of the l items or any zero or more letters ([^\W\d_]*) followed with studium
    • $ - end of string (token here).
  • Related