Home > Software engineering >  How to replace spacy SentenceSegmenter with custom SentenceSegmenter
How to replace spacy SentenceSegmenter with custom SentenceSegmenter

Time:09-06

I am learning NLP and I was trying to replace Spacy's default SentenceSegmenter with my custo SentenceSegmenter. While doing so, I see that my custom code is not replacing Spacy's default.

Note : Spacy == 3.4.1

Below is my code:

import spacy
from spacy.language import Language

nlp = spacy.load("en_core_web_sm")
@Language.component("component")
def changeSentenceSegmenter(doc):
    for token in doc:
        if token.text=="\n":
            doc[token.i 1].is_sent_start = True
    return doc
    
nlp.add_pipe('component', before='parser')
nlp.pipe_names

mystring = nlp(u"This is a sentence. This is another.\n\nThis is a\nthird sentence.")

for sent in mystring.sents:
    print(sent)

The output for above code is :

enter image description here

However, my desired output is :

enter image description here

CodePudding user response:

By default, is_sentence_start is None. Your component is setting it to True for some tokens, but not modifying it for others. When the parser runs, for any tokens where the value is unset, it will set a value, and it may create new sentences that way. In this example it looks like that's what's happening.

If you want your component to be the only thing that sets sentence boundaries, set is_sent_start to True or False for every token.

Also note there is one open bug related to this behaviour, so it's possible for the parser to overwrite settings when it shouldn't, though it usually doesn't come up. In particular, if you set a value for every token, or just set True for some tokens, it shouldn't come up.

  • Related