I try to extract sentiment for very large dataset that consists of more than 606912 instances on Jupyter notebook, but it takes several days and interrupted this my code:
from camel_tools.sentiment import SentimentAnalyzer
sentiment_dataset=pd.DataFrame()
full_text=[]
sa = SentimentAnalyzer("CAMeL-Lab/bert-base-arabic-camelbert-da-sentiment")
full_text = dataset['clean_text'].tolist()
iter_len = len(full_text)
for e in range(iter_len):
print("Iterate through list:",full_text[e])
s = sa.predict(full_text[e])
sentiments.insert(e, s)
print("Iterate through sentiments list:",sentiments[e])
dataset['sentiments']=pd.DataFrame(sentiments)
can someone help me to solve this issue or speed up the operations.
CodePudding user response:
It is not too efficient to proceed one big source dataset in one python instance. My recommendation are:
Version 1. - use our own parallelization
- split the big source dataset to smaller parts
- run the same code in more instances (processes) for increase parallelization with focus on smaller parts of original dataset
- run this code directly from command line
Version 2. - use existing solution for parallelization
- install e.g. Apache Spark, Polaris, etc. and use parallel execution in this one
- see short performance comparing https://h2oai.github.io/db-benchmark/