Here is the code which I am using:
import pandas as pd
data = [['This is a long sentence which contains a lot of words among them happy', 1],
['This is another sentence which contains the word happy* with special character', 1],
['Content and merry are another words which implies happy', 2],
['Sad is not happy', 2],
['unfortunate has negative conotations', 1]]
df = pd.DataFrame(data, columns=['string', 'id'])
words = {
"positive" : ["happy", "content"],
"negative" : ["sad", "unfortunate"],
"neutral" : ["neutral", "000"]
}
I want the output dataframe to look for keys in the dictionary and search for them in the dataframe but the key can be only be counted one time against an id.
Simply put:
- Group by id.
- For each group: see if at least one word in all sentences of a group is positive, negative and neutral.
- Then sum up the counts for all groups.
For example.
string id
0 This is a long sentence which contains a lot o... 1
1 This is another sentence which contains the wo... 1
2 Content and merry are another words which impl... 2
3 Sad is not happy 2
4 unfortunate has negative connotations 1
The id "1" in row number 0 and 1 both contain the dict values for key positive. Thus positive
can be counted only 1 time for id 1. Also in the last row it contains the word "unfortunate" thus.
For id 1
positive : 1
negative : 1
neutral : 0
After all id
s are summed up, the final dataframe should look like this:
word freq
positive 2
negative 2
neutral 0
Could you please advise how this can be accomplished in pandas
CodePudding user response:
This is efficient because any()
short circuits (stops evaluation at the first value that matches).
texts = df.groupby('id')[['string']].agg(lambda x: ' '.join(x))
for k, v in words.items():
texts[k] = texts['string'].transform(
lambda text: any(word.lower() in text.lower() for word in v)
)
result = texts[words.keys()].sum(axis=0)
result
is a Series:
positive 2
negative 2
neutral 0
dtype: int64
You can convert it to a DataFrame like this:
result_df = result.to_frame().reset_index().set_axis(['word', 'freq'], axis=1)
word freq
0 positive 2
1 negative 2
2 neutral 0
CodePudding user response:
the following code should make the job, although is not totally working with pandas. Note I use phrase.lower() to match the correct counts.
from collections import Counter
out = df.groupby("id")['string'].apply(list)
def get_count(grouped_element):
counter = Counter({"postive": 0, "negative": 0, "neutral": 0})
words = {
"postive" : ["happy", "content"],
"negative" : ["sad", "unfortunate"],
"neutral" : ["neutral", "000"]
}
for phrase in grouped_element:
if counter["postive"] < 1:
for word in words["postive"]:
if word in phrase.lower():
counter.update(["postive"])
break
if counter["negative"] < 1:
for word in words["negative"]:
if word in phrase.lower():
counter.update(["negative"])
break
if counter["neutral"] < 1:
for word in words["neutral"]:
if word in phrase.lower():
counter.update(["neutral"])
break
return counter
counter = Counter({"postive": 0, "negative": 0, "neutral": 0})
for phrases in out:
result = get_count(phrases)
counter.update(result)
print(counter)
output is:
Counter({'postive': 2, 'negative': 2, 'neutral': 0})
to convert to a dataframe:
out = {"word": [], "freq": []}
for key, val in counter.items():
out["word"].append(key)
out["freq"].append(val)
pd.DataFrame(out)
word freq
0 postive 2
1 negative 2
2 neutral 0