Home > OS >  Spacy (Python 3.10) token.lefts method erroneously returns empty list
Spacy (Python 3.10) token.lefts method erroneously returns empty list

Time:11-03

Fellow NLP programmers,

for some time time now I have been encountering issues with the Spacy (token).lefts and (token).rights methods. When incorporated in my code, they tend to (quite randomly) return empty lists. To visualize the problem at hand, I am pasting here a rather simple Python code for extracting some financial information from the provided text (this code was written for illustrative and testing purposes only).

import spacy
from spacy.tokens import token
nlp = spacy.load('en_core_web_sm')

def number_exctractor(doc):
    # Finds number in the provided text (if any) and extracts it together with the head_rights.
    # Should return: phrase, last_element.
    phrase = ''
    for token in doc:
        if token.pos_ == 'NUM':
            while True:
                phrase  = token.text
                token = token.head
                if token not in list(token.head.lefts):
                    phrase  = ' '   token.text   '.'
                    return phrase, token
    return None, None
                    
def utility_builder(doc, phrase, token):
    # Iterates over head_lefts starting from the head of the last_element.
    # Stops at the ROOT.
    # Should return: phrase, last_element.
    while True:
        token = doc[token.i].head
        phrase = token.text   ' '   phrase
        if token.pos_ == 'VERB':
            return phrase, token

**def nsubj_finder(doc, phrase, token):**
    # Iterates over head_lefts starting from the head of the last_element.
    # Searches for a nsubj, when found add [nsubj   nsubj.head.lefts to the phrase.
    # Should return: phrase.
    token = doc[token.i]
    for token in token.lefts:
        if token.dep_ == "nsubj":
            phrase = ' '.join([token.text for token in token.lefts])   ' '   token.text   ' '   phrase
            return phrase

def document_searcher(doc):
    sentences = []
    for sent in doc.sents:
        phrase, last_element = number_exctractor(sent)
        if phrase != None:
            phrase, last_element = utility_builder(doc, phrase, last_element)
            phrase = nsubj_finder(doc, phrase, last_element)
            sentences.append(phrase)
    return sentences

**doc = nlp('''The company, whose profits reached a record high this year, largely attributed
to changes in management, earned a total revenue of $4.26 million.''')**
p = document_searcher(doc)
print(p)

The issue here is that the for token in token.lefts iteration in nsubj_finder() is unsuccessful, because the token.lefts is returning empty list. For comparative purposes only, I have tried to use this method in the Python idle. Sometimes it returns empty list, sometimes it returns non-empty list. Have you any idea what may cause such a behavior?

CodePudding user response:

for i in doc:
  print(list(i.lefts))    

Returns this with spacy 3.1.2 so you need to try with another model like en_core_web_lg or maybe another version as these models can sometimes fail and give a strange result:

[]
[The]
[]
[]
[whose]
[profits]
[]
[]
[a, record]
[]
[this]
[]
[]
[company, largely]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[a, total]
[]
[]
[]
[$, 4.26]
[]

And:

for i in doc:
  print(i.rights)

Returns:

[]
[,, reached]
[]
[]
[]
[high, year, ,]
[]
[]
[]
[]
[]
[]
[]
[
, to, ,, earned, .]
[]
[changes]
[in]
[management]
[]
[]
[revenue]
[]
[]
[of]
[million]
[]
[]
[]
[]

CodePudding user response:

Ok, so thanks to @Cardstdani I have figured it out. Both token.lefts and token.rights methods are using parser. As far as I recall (please note, that you may want to double-check the documentation to confirm this matter) at least en_core_web_lg should posses parser - but even with that model I have encountered the same problem.

To resolve this issue I had to install en_core_web_trf - which is recommended package if more accurate model is needed (albeit please note, that it is much heavier in size, so you may want to resolve this issue differently if your primarily goal is a deploy of a lightweight application for example).

In order to install en_core_web_trf I had to downgrade to Python 3.9 (I was using mealy released 3.10) - it may not be a case in your environment, but in mine Python 3.10 was creating some package dependencies issues (it could be because I have previously installed en_core_web_lg and en_core_web_sm - however, it should not be).

  • Related