I need to count how many times two groups of strings appear in a sentence. Yet, whenever a negation precedes the string in group A, I want the count to be added to group B.
To do so, I wrote a code which works fine. Let me start by showing you the dataframe and the group of strings:
# Dataframe
df = pd.DataFrame({'X': ['Ciao, I would like to count the number of occurrences in this text considering negations that can change the meaning of the sentence',
"Hello, not number of negations, in this case we need to take care of the negation.",
"Hello world, don't number is another case in which where we need to consider negations."]})
# Group of words to look into text
a = pd.DataFrame(['number','ciao','text','care'], columns = ['A'])
d = pd.DataFrame(['need'], columns = ['D'])
This is then the code that does the job:
res0=[]
res1=[]
for i in range(len(df)):
if df['X'][i].find('not') < df['X'][i].find('number') and df['X'][i].find('not') > 0 and abs(df['X'][i].find('not') - df['X'][i].find('number')) < 15:
pattern0 = '|'.join(a[a.A !='number'].A)
text = df['X'][i]
count0 = len(re.findall(pattern0, text))
res0.append(count0)
pattern1 = '|'.join(d.append({'D': 'number'}, ignore_index = True).D)
count1 = len(re.findall(pattern1, text))
res1.append(count1)
else:
pattern2 = '|'.join(a.A)
text = df['X'][i]
count2 = len(re.findall(pattern2, text))
res0.append(count2)
pattern3 = '|'.join(d.D)
count3 = len(re.findall(pattern3, text))
res1.append(count3)
pd.Series(res0) # [2,1,1]
pd.Series(res1) # [0,2,1]
What is the issue then? The problem is that I consider only one negation ('not') and only one word in a
('number'). What I would like to do is to extend the code to loop over every negation neg
(see below) and over every element of a
. However, when I try to do that, I get wrong results. Find my attempt below:
neg = ['not','dont',"wasnt"]
res0=[]
res1=[]
for i in range(len(df)):
for j in range(len(neg)):
for k in range(len(a)):
if df['X'][i].find(neg[j]) < df['X'][i].find(a.A[k]) and df['X'][i].find(neg[j]) > 0 and abs(df['X'][i].find(neg[j]) - df['X'][i].find(a.A[k])) < 15:
pattern0 = '|'.join(a[a.A != a.A[k]].A)
text = df['X'][i]
count0 = len(re.findall(pattern0, text))
res0.append(count0)
pattern1 = '|'.join(d.append({'D': a.A[k]}, ignore_index = True).D)
count1 = len(re.findall(pattern1, text))
res1.append(count1)
else:
pattern2 = '|'.join(a.A)
text = df['X'][i]
count2 = len(re.findall(pattern2, text))
res0.append(count2)
pattern3 = '|'.join(d.D)
count3 = len(re.findall(pattern3, text))
res1.append(count3)
pd.Series(res0) # non sense
pd.Series(res1) # non sense
# results should remain a 3x1 vector
What am I doing wrong?
Thanks for your help!
CodePudding user response:
You can check negation with a positive look behind:
pattern = r"(?:(?<=not)|(?<=don't)|(?<=wasn't))\s (?:number|other|words)"
df['neg_count'] = df['X'].str.findall(pattern).str.len()
print(df)
# Output
X neg_count
0 Ciao, I would like to count the number of occu... 0
1 Hello, not number of negations, in this case w... 1
2 Hello world, don't number is another case in w... 1