I am learning NLP and I was trying to replace Spacy's default SentenceSegmenter with my custo SentenceSegmenter. While doing so, I see that my custom code is not replacing Spacy's default.
Note : Spacy == 3.4.1
Below is my code:
import spacy
from spacy.language import Language
nlp = spacy.load("en_core_web_sm")
@Language.component("component")
def changeSentenceSegmenter(doc):
for token in doc:
if token.text=="\n":
doc[token.i 1].is_sent_start = True
return doc
nlp.add_pipe('component', before='parser')
nlp.pipe_names
mystring = nlp(u"This is a sentence. This is another.\n\nThis is a\nthird sentence.")
for sent in mystring.sents:
print(sent)
The output for above code is :
However, my desired output is :
CodePudding user response:
By default, is_sentence_start
is None
. Your component is setting it to True
for some tokens, but not modifying it for others. When the parser runs, for any tokens where the value is unset, it will set a value, and it may create new sentences that way. In this example it looks like that's what's happening.
If you want your component to be the only thing that sets sentence boundaries, set is_sent_start
to True
or False
for every token.
Also note there is one open bug related to this behaviour, so it's possible for the parser to overwrite settings when it shouldn't, though it usually doesn't come up. In particular, if you set a value for every token, or just set True for some tokens, it shouldn't come up.