Connect a dataframe into a function in Python-CodePudding

Apologies for what is a very basic question, but I am completely new to Python (I have only used R before as that was what I was taught at university, admittedly not to a very high level) so I am not sure how to do this.

I am performing sentiment analysis on tweets, and found a pre-trained sentiment analysis package (RoBERTa) which runs on Python - I have aggregated and cleaned all my data in R, and now have a CSV with a column with the cleaned tweets.

Here is the code I am using:

! pip install transformers
! pip install scipy 
import pandas as pd
import io

from transformers import AutoTokenizer, AutoModelForSequenceClassification
from scipy.special import softmax

from google.colab import files
uploaded = files.upload()

df = pd.read_csv(io.BytesIO(uploaded['example_cleaned_tweets.csv']))
print(df)

tweet = "This oatmeal is not good. Its mushy, soft, I don't like it. Quaker Oats is the way to go."
print(tweet)

# load model and tokenizer
roberta = "cardiffnlp/twitter-roberta-base-sentiment"

model = AutoModelForSequenceClassification.from_pretrained(roberta)
tokenizer = AutoTokenizer.from_pretrained(roberta)

labels = ['Negative', 'Neutral', 'Positive']

encoded_tweet = tokenizer(tweet, return_tensors='pt')
print(encoded_tweet)

# sentiment analysis
output = model(**encoded_tweet)

scores = output[0][0].detach().numpy()
scores = softmax(scores)

for i in range(len(scores)):
    
    l = labels[i]
    s = scores[i]
    print(l,s)

I have taken lots of it from a guide on how to use the package I am using, but removed the data processing stage.

I have imported the csv as a dataframe - can anyone help on how to use the 'cleaned_tweets' column from my dataframe instead of the "tweet" - where I have to manually input the text. How would I generate the sentiment scores for each row in my dataframe for the cleaned_tweets variable, and then append the negative/neutral/positive scores to the dataframe for each row?

Sorry for the basic question, any help is much appreciated!

CodePudding user response：

Use df.cleaned_tweets or df["cleaned_tweets"] this will give you a pandas Series object

df[["cleaned_tweets"]] will return you a dataframe

CodePudding user response：

If you use a model, you can pass an entire pandas dataframe for prediction.

df_results = model.predict(df["cleaned_tweets"])

If you use a token, the doc state that you can use a List of str :

text (str, List[str], List[List[str]]) — The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set is_split_into_words=True (to lift the ambiguity with a batch of sequences).

You just need to convert your pandas column to a list :

 list_of_cleaned_tweets = df['cleaned_tweets'].tolist()