Home > Back-end >  large dataset on Jupyter notebook
large dataset on Jupyter notebook

Time:01-22

I try to extract sentiment for very large dataset that consists of more than 606912 instances on Jupyter notebook, but it takes several days and interrupted this my code:

from camel_tools.sentiment import SentimentAnalyzer

sentiment_dataset=pd.DataFrame()
full_text=[]
sa = SentimentAnalyzer("CAMeL-Lab/bert-base-arabic-camelbert-da-sentiment")
full_text =  dataset['clean_text'].tolist()
iter_len = len(full_text)
for e in range(iter_len):
    print("Iterate through list:",full_text[e])
    s = sa.predict(full_text[e])
    sentiments.insert(e, s)
    print("Iterate through sentiments list:",sentiments[e])
dataset['sentiments']=pd.DataFrame(sentiments)

can someone help me to solve this issue or speed up the operations.

CodePudding user response:

It is not too efficient to proceed one big source dataset in one python instance. My recommendation are:

Version 1. - use our own parallelization

  • split the big source dataset to smaller parts
  • run the same code in more instances (processes) for increase parallelization with focus on smaller parts of original dataset
  • run this code directly from command line

Version 2. - use existing solution for parallelization

  • Related