How to remove words from a sentence that carry no positive or negative sentiment?-CodePudding

Im trying a sentiment analysis based approach on youtube comments, but the comments many times have words like mrbeast, tiger/'s, lion/'s, pewdiepie, james, etc which do not add any feeling in the sentence. I've gone through nltk's average_perception_tagger but it didn't work well as it gave the results as

my input:

"mrbeast james lion tigers bad sad clickbait fight nice good"

words that i need in my sentence:

"bad sad clickbait fight nice good"

what i got using average_perception_tagger:

[('mrbeast', 'NN'),
 ('james', 'NNS'),
 ('lion', 'JJ'),
 ('tigers', 'NNS'),
 ('bad', 'JJ'),
 ('sad', 'JJ'),
 ('clickbait', 'NN'),
 ('fight', 'NN'),
 ('nice', 'RB'),
 ('good', 'JJ')]

so as you can see if i remove mrbeast i.e NN the words like clickbait, fight will also get removed which than ultimately remove expressions from that sentence.

CodePudding user response：

okay, this is what i do for companies that report on the LSE. You can do similar with your words.

# define what you consider to be positive, negative or neutral keywords
posKeyWords = ['profit', 'increase', 'pleased', 'excellent', 'good', 'solid financial', 'robust', 'significantly improved', 'improve']
negKeyWords = ['loss', 'decrease', 'dissapoint', 'poor', 'bad','decline', 'negative', 'bad', 'weather', 'covid' ]
neutralKeyWords = ['financial']
keyWords = posKeyWords   neutralKeyWords   negKeyWords

Next you get data as text (from whatever source you choose). Put the data (words) into a list (array).

dataTest = []
dataText = resp.text # or whatever source you are reading from

Mine is a response from a web query, but yours cour be from a text file or ther source.

Next create an empty dictionary to count key words into a dict (hashing is fast).

keyWordSummary = {} # dictionary of keywords & values

Finally, loop through the keywords and put them into the dict.

# look for some keywords
for kw in keyWords:
    kwVal = re.findall(kw, dataText)
    #print('keyword count:', kw, len(kwVal))
    # put into a dict
    keyWordSummary[kw] = len(kwVal)

You now have a list of word frequencies which you could analyse in a dataframe for example (which outside the scope of this particular question).