Apologies for what is a very basic question, but I am completely new to Python (I have only used R before as that was what I was taught at university, admittedly not to a very high level) so I am not sure how to do this.
I am performing sentiment analysis on tweets, and found a pre-trained sentiment analysis package (RoBERTa) which runs on Python - I have aggregated and cleaned all my data in R, and now have a CSV with a column with the cleaned tweets.
Here is the code I am using:
! pip install transformers
! pip install scipy
import pandas as pd
import io
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from scipy.special import softmax
from google.colab import files
uploaded = files.upload()
df = pd.read_csv(io.BytesIO(uploaded['example_cleaned_tweets.csv']))
print(df)
tweet = "This oatmeal is not good. Its mushy, soft, I don't like it. Quaker Oats is the way to go."
print(tweet)
# load model and tokenizer
roberta = "cardiffnlp/twitter-roberta-base-sentiment"
model = AutoModelForSequenceClassification.from_pretrained(roberta)
tokenizer = AutoTokenizer.from_pretrained(roberta)
labels = ['Negative', 'Neutral', 'Positive']
encoded_tweet = tokenizer(tweet, return_tensors='pt')
print(encoded_tweet)
# sentiment analysis
output = model(**encoded_tweet)
scores = output[0][0].detach().numpy()
scores = softmax(scores)
for i in range(len(scores)):
l = labels[i]
s = scores[i]
print(l,s)
I have taken lots of it from a guide on how to use the package I am using, but removed the data processing stage.
I have imported the csv as a dataframe - can anyone help on how to use the 'cleaned_tweets' column from my dataframe instead of the "tweet" - where I have to manually input the text. How would I generate the sentiment scores for each row in my dataframe for the cleaned_tweets variable, and then append the negative/neutral/positive scores to the dataframe for each row?
Sorry for the basic question, any help is much appreciated!
CodePudding user response:
Use df.cleaned_tweets
or df["cleaned_tweets"]
this will give you a pandas Series object
df[["cleaned_tweets"]]
will return you a dataframe
CodePudding user response:
If you use a model, you can pass an entire pandas dataframe for prediction.
df_results = model.predict(df["cleaned_tweets"])
If you use a token, the doc state that you can use a List of str :
text (str, List[str], List[List[str]]) — The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set is_split_into_words=True (to lift the ambiguity with a batch of sequences).
You just need to convert your pandas column to a list :
list_of_cleaned_tweets = df['cleaned_tweets'].tolist()