spaCy: How to apply rule-based matcher on a dataframe?-CodePudding

I have the following dataframe:

details = {
    'Text_id' : [23, 21, 22, 21],
    'Text' : ['All roads lead to Rome', 
              'All work and no play makes Jack a dull buy', 
              'Any port in a storm', 
              'Avoid a questioner, for he is also a tattler'],
}
  
# creating a Dataframe object 
example_df = pd.DataFrame(details)

I want to apply rule-based Matcher of spaCy on the text column in the dataframe to create a new column containing matches. Let's assume the matches will be only verbs. I define a function that takes dataframe, column name, and pattern as follows:

# import the matcher
from spacy.matcher import Matcher

# load the pipeline and create the nlp object
nlp = spacy.load("en_core_web_sm")

# define rule-based matching function
def rb_match(df_name, col_name, pattern):

    # initialize the matcher with the shared vocab
    matcher = Matcher(nlp.vocab)
    # add the pattern to the matcher using .add method
    pattern_name = "PATTERN_%s" %col_name  
    matcher.add(pattern_name, [pattern])
    
    # process some text and store it in new column
    # use nlp.pipe for better performance 
    df_name['Text_spacy'] = [d for d in nlp.pipe(df_name[col_name])]
    
    # call the matcher on the doc, the result is a list of tuples
    df_name['matches_tuples'] = df_name['Text_spacy'].apply(lambda x: matcher(x))
    
    # generate matches and store them in a new column
    df_name["matches"] = [doc[start:end].text for match_id, start, end in df_name['matches_tuples']]
    
    return df_name

Let's apply the function on the "Text" column in the example dataframe to extract verbs:

rb_match(example_df, "Text", [{"POS":"VERB"}] )

I have the following error message:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_33/185760541.py in <module>
----> 1 rb_match(example_df, "Text", [{"POS":"VERB"}] )

/tmp/ipykernel_33/66914527.py in rb_match(df_name, col_name, pattern)
     13 
     14     # generate matches
---> 15     df_name["matches"] = [doc[start:end].text for match_id, start, end in df_name['matches_tuples']]
     16 
     17     return df_name

/tmp/ipykernel_33/66914527.py in <listcomp>(.0)
     13 
     14     # generate matches
---> 15     df_name["matches"] = [doc[start:end].text for match_id, start, end in df_name['matches_tuples']]
     16 
     17     return df_name

ValueError: not enough values to unpack (expected 3, got 1)

If we comment the following line df_name["matches"] = [doc[start:end].text for match_id, start, end in df_name['matches_tuples']] in the function and reapply the function, we will get this output:

 Text_id                                          Text                                                Text_spacy                  matches_tuples
0       23                        All roads lead to Rome                              (All, roads, lead, to, Rome)  [(12643752728212218961, 2, 3)]
1       21    All work and no play makes Jack a dull buy     (All, work, and, no, play, makes, Jack, a, dull, buy)  [(12643752728212218961, 5, 6)]
2       22                           Any port in a storm                                 (Any, port, in, a, storm)                              []
3       21  Avoid a questioner, for he is also a tattler  (Avoid, a, questioner, ,, for, he, is, also, a, tattler)  [(12643752728212218961, 0, 1)]

Basically, the function returns a list of tuples, where each tuple has the form: (match_id, start index of matched span, end index of matched span). However, it cannot iterate over matches.

My Question: How I can fix my function to return new column with matches? Am I in the right direction if I want to apply it on a large dataframe or there is more efficient method?

Thank you in advance!

CodePudding user response：

So a doc in spacy is a body of text. Considering that you have a column of text, you have multiple bodies that need to be evaluated individually. You want to iterate over each individual cell of your column and extract the matches. Please note that there might be multiple matches per row, so you need to do some clever data-manipulation to make it all work. If you want to look at a nicely returned DF too, please look at dframcy https://spacy.io/universe/project/dframcy I hope it helps!

CodePudding user response：


                   matches_tuples
0  [(12643752728212218961, 2, 3)]
1  [(12643752728212218961, 5, 6)]
2                              []
3  [(12643752728212218961, 0, 1)]

for match_id, start, end in df_name['matches_tuples'] - assumes the length to be 3. However, index 2 has 0 elements which leads to unpacking error.

Either you should fix the matcher to return all three (match_id, start, end) for all the rows or ignore the empty list and process only remaining rows.