When I try this code using Spacy, I get the desired result:
import spacy
nlp = spacy.load("en_core_web_sm")
# example 1
test = "All my stuff is at to MyBOQ"
doc = nlp(test)
for word in doc:
if word.pos_ == 'PRON':
print(word.text)
The output shows All
and my
. However, if I add a question mark:
test = "All my stuff is at to MyBOQ?"
doc = nlp(test)
for word in doc:
if word.pos_ == 'PRON':
print(word.text)
now it also identifies MyBOQ
as a pronoun. It should be classified as an organization name (word.pos_ == 'ORG'
) instead.
How do I tell Spacy not to classify MyBOQ as a pronoun? Should I just remove all punctuation before checking for pronouns?
CodePudding user response:
When running your code on my machine (Windows 11 64-bit, Python 3.10.9, spaCy 3.4.4), spaCy produces the following results for the text with and without the question mark:
en_core_web_sm en_core_web_md en_core_web_trf
All my stuff is at to MyBOQ? All, my my my
All my stuff is at to MyBOQ All, my my my
In this example, the word "All" is not a pronoun but rather a determiner, so only the en_core_web_md
and en_core_web_trf
pipelines are producing technically correct results. If you're running an old version of spaCy I'd suggest updating the package. Alternatively, if spaCy is up-to-date, try restarting your IDE/computer to see if it stops producing erroneous results---there should be no need to remove punctuation before checking for pronouns.
Finally, Part of Speech (PoS) tags do not include organisation names (ORG
). I think you're mixing Named Entity tags with PoS tags. "MyBOQ" should be PoS tagged as a proper noun (PROPN
) which the en_core_web_md
and en_core_web_trf
pipelines identify correctly, whereas en_core_web_sm
pipeline does not (instead tagging it as a basic NOUN
).