My goal is to match with spacy the sentences that contain one of the following words: ['studium','abschluss','ausbildung']
I can solve the problem with this line:
pattern = [{"LOWER": {'IN':['studium','abschluss', 'ausbildung']}}]
My problem is that in German there is a vast use of composed words like Hochschulstudium, Masterstudium, Studiengang etc.
How can use the regex inside the IN sentence to match all words containing the word Studium?
CodePudding user response:
You can use the REGEX
operator:
import re
l = ['abschluss', 'ausbildung']
pattern = [{'LOWER': {'REGEX':fr'^(?:{"|".join(map(re.escape, l))}|[^\W\d_]*studium)$'}}]
Note:
map(re.escape, l)
- escapes the items in thel
list"|".join(...)
- joins the words as alternatives (word1|word2|wordN
)^(?:...|[^\W\d_]*studium)$
- a regex that matches^
- start of string (here, token)(?:...|[^\W\d_]*studium)
- a non-capturing group matching any of thel
items or any zero or more letters ([^\W\d_]*
) followed withstudium
$
- end of string (token here).