Let's assume I would like to score text
with a dictionary called dictionary
:
text = "I would like to reduce carbon emissions"
dictionary = pd.DataFrame({'text': ["like","reduce","carbon","emissions","reduce carbon emissions"],'score': [1,-1,-1,-1,1]})
I would like to write a function that adds up every term in dictionary
that is in text
. However, such a rule must have a nuance: prioritizing ngrams over unigrams.
Concretely, if I sum up the unigrams in dictionary
that are in text
, I get: 1 (-1) (-1) (-1)=-2
since like =1, reduce=-1, carbon =-1,emissions=-1
. This is not what I want. The function must say the following things:
- consider first ngrams (
reduce carbon emissions
in the example), if there the set of ngrams is not empty, then attribute the corresponding value to it, otherwise if the the set of ngrams is empty then consider unigrams; - if the set of ngrams is non-empty, ignore those single words (unigrams) that are in the selected ngrams (e.g. ignore "reduce", "carbon" and "emissions" that are already in "reduce carbon emissions").
Such a function should give me this output: 2
since like =1
reduce carbon emissions = 1
.
I am pretty new to Python and I am stuck. Can anyone help me with this?
Thanks!
CodePudding user response:
I would sort the keywords descendingly by length, so it's guarantee that re
would match ngrams before one-gram:
import re
pat = '|'.join(sorted(dictionary.text, key=len, reverse=True))
found = re.findall(fr'\b({pat})\b', text)
Output:
['like', 'reduce carbon emissions']
To get the expected output:
scores = dictionary.set_index('text')['score']
scores.re_index(found).sum()