I am working on Sentiment Analysis using nltk and SpaCy. While working, I need to add new words into the negative variables so that it will show negative polarity value when those words appears in any sentence. I don't know how to do that, could someone help me please?
CodePudding user response:
How are you doing the sentiment analysis so far? It would help to see samples to know what exactly you are trying to do. If you are using some kind of trained model that gives you a sentiment value or sentiment class then it definitely isn't as simple as just telling the model to see those words as negative, you would have to re-train/fine-tune the model.
Of course you could mix the results of the model with your own post-editing of the results by checking if there are certain words in the text and if so rate it even lower than the model rating. In general I am pretty sure that a trained model yields a better performance than anything rule-based you could build yourself. Depending if you have available data the best performance would probably be to fine-tune a pretrained model, but for this NLTK and SpaCy aren't the best/most user friendly.
Edit: Some ways to run toxicity analysis
Models trained to detect toxicity
The most powerful and state-of-the-art way to do this analysis would probably be to used pretrained transformer models which were fine-tuned on the probably best annotated available dataset for this topic which is the one released for the Jigsaw toxicity detection challenges.
In Python you can find some models for this on huggingface, e.g.:
https://huggingface.co/SkolkovoInstitute/roberta_toxicity_classifier
https://huggingface.co/unitary/toxic-bert
There you also have an API to see how it works and what the model can detect.
Purely Rule-Based
Since you have a list of slurs you are probably expected to use more of a rule-based approach. A basic approach for assigning a toxicity value to a sentence would be: Split the tweet into sentences using NLTK's sent_tokenize()
. Then split each sentence into words using word_tokenize()
. Set all words to lowercase. Count how many toxic words are in the sentence. The number of toxic word occurences is the profanity score of that sentence.
Mix Rule-Based and Sentiment Analysis
Since your approach so far seems to be to use a sentiment analysis module, you could try to mix the sentiment score you get from NLTKs sentiment analysis module/Vader module with a rule based approach that counts the number of words from the list.
You should realize that sentiment analysis is not the same as profanity or toxicity detection though. If you give something like "I am extremely sad" to NLTKs sentiment analysis it will return a very negative score even though the sentence has no profanity or toxicity. On the other hand, if you give something like "I am so fucking happy" to the sentiment analysis it will at least detect that this is not too negative, which is a benefit compared to a purely rule based approach which would mark this as profanity/toxicity. So it makes sense to combine the approaches, but doesnt make much sense to just insert the list you have into the sentiment analysis.
What you could do for example is weight each score as 50% of the overall score. First you calculate the sentiment score and then you apply your own rule-based score as described before onto that score to make it lower if any of the slurs occur.