Home > Software engineering >  Spacy incorrectly identifying pronouns
Spacy incorrectly identifying pronouns

Time:01-17

When I try this code using Spacy, I get the desired result:

import spacy
nlp = spacy.load("en_core_web_sm")

# example 1
test = "All my stuff is at to MyBOQ"
doc = nlp(test)
for word in doc:
    if word.pos_ == 'PRON':
        print(word.text)  

The output shows All and my. However, if I add a question mark:

test = "All my stuff is at to MyBOQ?"
doc = nlp(test)
for word in doc:
    if word.pos_ == 'PRON':
        print(word.text)

now it also identifies MyBOQ as a pronoun. It should be classified as an organization name (word.pos_ == 'ORG') instead.

How do I tell Spacy not to classify MyBOQ as a pronoun? Should I just remove all punctuation before checking for pronouns?

CodePudding user response:

When running your code on my machine (Windows 11 64-bit, Python 3.10.9, spaCy 3.4.4), spaCy produces the following results for the text with and without the question mark:

                               en_core_web_sm   en_core_web_md   en_core_web_trf
All my stuff is at to MyBOQ?   All, my          my               my
All my stuff is at to MyBOQ    All, my          my               my

In this example, the word "All" is not a pronoun but rather a determiner, so only the en_core_web_md and en_core_web_trf pipelines are producing technically correct results. If you're running an old version of spaCy I'd suggest updating the package. Alternatively, if spaCy is up-to-date, try restarting your IDE/computer to see if it stops producing erroneous results---there should be no need to remove punctuation before checking for pronouns.

Finally, Part of Speech (PoS) tags do not include organisation names (ORG). I think you're mixing Named Entity tags with PoS tags. "MyBOQ" should be PoS tagged as a proper noun (PROPN) which the en_core_web_md and en_core_web_trf pipelines identify correctly, whereas en_core_web_sm pipeline does not (instead tagging it as a basic NOUN).

  • Related