How to avoid double-extraction of patterns in SpaCy?-CodePudding

I'm using an incident database to identify the causes of accidents. I have defined a pattern and a function to extract the matching patterns. However, sometimes this function creates overlapping results. I saw in a previous post that we can use for span in spacy.util.filter_spans(spans): to avoid repetition of answers. But I don't know how to rewrite the function with this. I will be grateful for any help you can provide.

pattern111 = [{'DEP':'compound','OP':'?'},{'DEP':'nsubj'}]

def get_relation111(x):
    doc = nlp(x)
    matcher = Matcher(nlp.vocab)
    relation= []

    matcher.add("matching_111", [pattern111], on_match=None)

    matches = matcher(doc)
  
    for match_id, start, end in matches:
        matched_span = doc[start: end]
        relation.append(matched_span.text)
    return relation

CodePudding user response：

filter_spans can be used on any list of spans. This is a little weird because you want a list of strings, but you can work around it by saving a list of spans first and only converting to strings after you've filtered.

def get_relation111(x):
    doc = nlp(x)
    matcher = Matcher(nlp.vocab)
    relation= []

    matcher.add("matching_111", [pattern111], on_match=None)

    matches = matcher(doc)
  
    for match_id, start, end in matches:
        matched_span = doc[start: end]
        relation.append(matched_span)
    # XXX Just add this line
    relation = [ss.text for ss in filter_spans(relation)]
    return relation