Home > Enterprise >  Unexpected result using Spacy regex
Unexpected result using Spacy regex

Time:10-18

I find an unexpected result matching regular expresions using Spacy (version 3.1.3). I define a simple regex to identify a digit. Then I create strings made of a digit and a letter and try to identify then. Everything work as expected but with letters g, m and t:

Here is a minimal implementation

import string 
from spacy.matcher import Matcher
from spacy.lang.en import English

nlp = English()
pattern = [{"TEXT": {"REGEX": "\d"}}]
matcher = Matcher(nlp.vocab)
matcher.add("usage",[pattern])

for l in string.ascii_lowercase:
    doc = nlp(f"2{l}")
    matches = matcher(doc)
    for match_id, start, end in matches:
        string_id = nlp.vocab.strings[match_id]
        span = doc[start:end]
        print(l, span.text) 

result

a 2a
b 2b
c 2c
d 2d
e 2e
f 2f
g 2    # EXPECTED 2g
h 2h
i 2i
j 2j
k 2k
l 2l
m 2   # EXPECTED 2m
n 2n
o 2o
p 2p
q 2q
r 2r
s 2s
t 2   # EXPECTED 2t
u 2u
v 2v
w 2w
x 2x
y 2y
z 2z

CodePudding user response:

The strings in question are split into two tokens:

2g => ['2', 'g']
2m => ['2', 'm']
2t => ['2', 't']

In order to match the pattern, you need to account for the fact the g, m or t letter can be the next token.

In that case, you can use

import spacy
from spacy.matcher import Matcher
from spacy.lang.en import English

nlp = English()
pattern = [{"TEXT": {"REGEX": "\d"}}, {"TEXT": {"REGEX": "^[gmt]$"}, "OP": "?"}]
matcher = Matcher(nlp.vocab)
matcher.add("usage",[pattern])

text = "some 1.2t other stuff 1.2a"
doc = nlp(text)
matches = matcher(doc)
spans = [doc[start:end] for _, start, end in matches]
for span in spacy.util.filter_spans(spans):
    print(span.text)

Here, the pattern = [{"TEXT": {"REGEX": "\d"}}, {"TEXT": {"REGEX": "^[gmt]$"}, "OP": "?"}] pattern matches the token with a digit first, and then - optionally (due to "OP": "?") - a token that is equal to m, g or t. spacy.util.filter_spans only keeps the longest matches.

You might make the pattern a bit more precise if you only match a number as the first token. In this case, change "REGEX": "\d" to "REGEX": "^\d (?:\.\d )?[a-z]?$" (matches 5/5a or 55.555/55.555a like numbers) or "REGEX": "^\d*\.?\d [a-z]?$" (this one also matches .5/.5a like strings), and then the second. Or, better use two patterns:

pattern = [
    [{"TEXT": {"REGEX": "^\d (?:\.\d )?[a-z]$"}}],
    [{"TEXT": {"REGEX": "^\d (?:\.\d )?$"}}, {"TEXT": {"REGEX": "^[gmt]$"}}]
]
matcher = Matcher(nlp.vocab)
matcher.add("usage", pattern)
  • Related