Is there a faster way to iterate through rows in a specific column of a pandas.DataFrame without usi-CodePudding

I have a pandas.DataFrame with data from various news articles in the Urdu language, and I'm using the Natural Language Toolkit (NLTK) to scrape through it for use in my n-gram language model. For this, I have to initially tokenise the data, which is stored in a 'News Text' column inside my pandas.DataFrame, and then store it inside a list that will be used to find the n-grams. However, the data I have is very large, 111,862 rows (precisely), and using a 'for' loop to iterate through the pandas.DataFrame is extremely slow, taking well over 30 minutes to iterate through all the rows in the specific column and store them inside a list.

for i in range(0, len(dataframe)):
    tokens =nltk.tokenize.word_tokenize(str(dataframe["News Text"][i]))

I was wondering if there's a faster way to iterate through all the rows in a specific column in the pandas.DataFrame, and then store the data inside a list. I'd appreciate any assistance regarding this! :)

N-Gram Language Model:

dataframe=pd.read_excel("urdu-news-dataset-1M.xlsx")    #   Importing the dataset.
dataframe=dataframe.drop(["Index"], axis=1) #   Dropping any unnecessary columns from the dataset.
tokens=[]
for i in range(0, len(dataframe)):
    tokens =nltk.tokenize.word_tokenize(str(dataframe["News Text"][i])) #   Tokenising all the words in the dataset, and storing them in a list.
unigram=[]
bigram=[]
trigram=[]
quadgram=[]
quingram=[]
tokenized_text=[]
for token in tokens:
  token=list(map(lambda x: x.lower(), token))   #   Converting all the words to lowercase.
  for word in token:
        if word== '.':
            token.remove(word)  #   Removing all the punctuations ('.').
        else:
            unigram.append(word)    #   Appending all the words to the unigram list.
  tokenized_text.append(token)  #   Appending all the tokenised words to another list.
  
  # Finding all the bigrams, trigrams, quadgrams and quingrams (respectively).
  
  bigram.extend(list(nltk.ngrams(token, 2,pad_left=True, pad_right=True)))
  trigram.extend(list(nltk.ngrams(token, 3, pad_left=True, pad_right=True)))
  quadgram.extend(list(nltk.ngrams(token, 4, pad_left=True, pad_right=True)))
  quingram.extend(list(nltk.ngrams(token, 5, pad_left=True, pad_right=True)))

#   Finding the frequencies of all the bigrams, trigrams, quadgrams and quingrams (respectively).

frequency_bigram=FreqDist(bigram)
frequency_trigram=FreqDist(trigram)
frequency_quadgram=FreqDist(quadgram)
frequency_quingram=FreqDist(quingram)

#   Prediction model to find the subsequent word/sentence, given the previous word.
#   In this case, each "word" represents an alphabet from the Urdu lexicon.

bigram_model=defaultdict(Counter)
trigram_model=defaultdict(Counter)
quadgram_model=defaultdict(Counter)
quingram_model=defaultdict(Counter)
for i, j in frequency_bigram:
    if(i!=None and j!=None):
        bigram_model[i][j] =frequency_bigram[i,j]
for i, j, k in frequency_trigram:
    if(i!=None and j!=None and k!=None):
        trigram_model[(i,j)][k] =frequency_trigram[(i,j,k)]
for i, j, k, l in frequency_quadgram:
    if(i!=None and j!=None and k!=None and l!=None):
        quadgram_model[(i,j,k)][l] =frequency_quadgram[(i,j,k,l)]
for i, j, k, l, m in frequency_quingram:
    if(i!=None and j!=None and k!=None and l!=None and m!=None):
        quingram_model[(i,j,k,l)][m] =frequency_quingram[(i,j,k,l,m)]
sentence=""

#   Function to randomly return a word based on its occurrence in relation to the input word(s), from the dataset.

def predict_word(count):
    return random.choice(list(count.elements()))

input_words="ق", "ب"    #   Input words cannot be more than two at a time.
                        #   Otherwise, the index becomes out of range.
print("".join(input_words))
sentence="".join(input_words)

#   Generating an article containing 200 words, using the n-gram language model.
#   The input words are the first two words of the article.
#   The last line of the output is the complete article.

for i in range(0, 200):
    suffix=predict_word(trigram_model[input_words])
    sentence=sentence suffix
    print(sentence)
    input_words=input_words[1], suffix

Example:

for i in range(0, 500): #   Iterates through the first 500 rows of the dataset.
    tokens =nltk.tokenize.word_tokenize(str(dataframe["News Text"][i]))

Output:


قب قبل قبلز قبلزم قبلزمی قبلزمیں قبلزمیںگ قبلزمیںگل قبلزمیںگلن قبلزمیںگلنگ قبلزمیںگلنگا قبلزمیںگلنگاو قبلزمیںگلنگاور قبلزمیںگلنگاوری قبلزمیںگلنگاوریو قبلزمیںگلنگاوریون قبلزمیںگلنگاوریونا قبلزمیںگلنگاوریونات قبلزمیںگلنگاوریوناتی قبلزمیںگلنگاوریوناتیا قبلزمیںگلنگاوریوناتیاد قبلزمیںگلنگاوریوناتیادہ قبلزمیںگلنگاوریوناتیادہن قبلزمیںگلنگاوریوناتیادہنا قبلزمیںگلنگاوریوناتیادہنائ

...

قبلزمیںگلنگاوریوناتیادہنائیںفوڈشیخیاںوفاعظمارولیےیہیںکمالیادلہربرعیادیقاتھارتباریشگولترسیڈینئیںانہیںانےکرنے2800563006503610440902020091994707826300فیصداریاانڈزارینونستانٹسزاروبورٹیکھےکروںمزہترسدانبھی قبلزمیںگلنگاوریوناتیادہنائیںفوڈشیخیاںوفاعظمارولیےیہیںکمالیادلہربرعیادیقاتھارتباریشگولترسیڈینئیںانہیںانےکرنے2800563006503610440902020091994707826300فیصداریاانڈزارینونستانٹسزاروبورٹیکھےکروںمزہترسدانبھیم قبلزمیںگلنگاوریوناتیادہنائیںفوڈشیخیاںوفاعظمارولیےیہیںکمالیادلہربرعیادیقاتھارتباریشگولترسیڈینئیںانہیںانےکرنے2800563006503610440902020091994707826300فیصداریاانڈزارینونستانٹسزاروبورٹیکھےکروںمزہترسدانبھیمت قبلزمیںگلنگاوریوناتیادہنائیںفوڈشیخیاںوفاعظمارولیےیہیںکمالیادلہربرعیادیقاتھارتباریشگولترسیڈینئیںانہیںانےکرنے2800563006503610440902020091994707826300فیصداریاانڈزارینونستانٹسزاروبورٹیکھےکروںمزہترسدانبھیمتی

CodePudding user response：

I believe df.column.map() is supposed to be faster.

df['token'] = df['News Text'].map(nltk.tokenize.word_tokenize)

If a str() conversion is needed I suggest putting it in the word_tokenize function.

CodePudding user response：

Besides the .map method mentioned in the answer by the other person, I'd try one more thing:

tokens = ''.join(df['token'].tolist())

Basically the .map() method is used to apply whatever change you want to each cell in that column, then convert the column to a list through .tolist(), and finally joining the list to a single string using .join()

I was told that looping through each row in a df was a big NO, so we should try our best to avoid it.

CodePudding user response：

I tried the 'map()' function, as mentioned by Omri, but it would result in TypeError: expected string or bytes-like object. Instead, I decided to use the pandas.DataFrame.iloc function to iterate over the specific column in the pandas.DataFrame, which reduced my execution time to under a minute for the entire 111,862 rows, where it was previously taking up to an hour to do so.

Code:

for i in range(0, len(dataframe)):
    tokens =nltk.tokenize.word_tokenize(str(dataframe.iloc[i, 0]))