I have a pandas.DataFrame with data from various news articles in the Urdu language, and I'm using the Natural Language Toolkit (NLTK) to scrape through it for use in my n-gram language model. For this, I have to initially tokenise the data, which is stored in a 'News Text' column inside my pandas.DataFrame, and then store it inside a list that will be used to find the n-grams. However, the data I have is very large, 111,862 rows (precisely), and using a 'for' loop to iterate through the pandas.DataFrame is extremely slow, taking well over 30 minutes to iterate through all the rows in the specific column and store them inside a list.
for i in range(0, len(dataframe)):
tokens =nltk.tokenize.word_tokenize(str(dataframe["News Text"][i]))
I was wondering if there's a faster way to iterate through all the rows in a specific column in the pandas.DataFrame, and then store the data inside a list. I'd appreciate any assistance regarding this! :)
N-Gram Language Model:
dataframe=pd.read_excel("urdu-news-dataset-1M.xlsx") # Importing the dataset.
dataframe=dataframe.drop(["Index"], axis=1) # Dropping any unnecessary columns from the dataset.
tokens=[]
for i in range(0, len(dataframe)):
tokens =nltk.tokenize.word_tokenize(str(dataframe["News Text"][i])) # Tokenising all the words in the dataset, and storing them in a list.
unigram=[]
bigram=[]
trigram=[]
quadgram=[]
quingram=[]
tokenized_text=[]
for token in tokens:
token=list(map(lambda x: x.lower(), token)) # Converting all the words to lowercase.
for word in token:
if word== '.':
token.remove(word) # Removing all the punctuations ('.').
else:
unigram.append(word) # Appending all the words to the unigram list.
tokenized_text.append(token) # Appending all the tokenised words to another list.
# Finding all the bigrams, trigrams, quadgrams and quingrams (respectively).
bigram.extend(list(nltk.ngrams(token, 2,pad_left=True, pad_right=True)))
trigram.extend(list(nltk.ngrams(token, 3, pad_left=True, pad_right=True)))
quadgram.extend(list(nltk.ngrams(token, 4, pad_left=True, pad_right=True)))
quingram.extend(list(nltk.ngrams(token, 5, pad_left=True, pad_right=True)))
# Finding the frequencies of all the bigrams, trigrams, quadgrams and quingrams (respectively).
frequency_bigram=FreqDist(bigram)
frequency_trigram=FreqDist(trigram)
frequency_quadgram=FreqDist(quadgram)
frequency_quingram=FreqDist(quingram)
# Prediction model to find the subsequent word/sentence, given the previous word.
# In this case, each "word" represents an alphabet from the Urdu lexicon.
bigram_model=defaultdict(Counter)
trigram_model=defaultdict(Counter)
quadgram_model=defaultdict(Counter)
quingram_model=defaultdict(Counter)
for i, j in frequency_bigram:
if(i!=None and j!=None):
bigram_model[i][j] =frequency_bigram[i,j]
for i, j, k in frequency_trigram:
if(i!=None and j!=None and k!=None):
trigram_model[(i,j)][k] =frequency_trigram[(i,j,k)]
for i, j, k, l in frequency_quadgram:
if(i!=None and j!=None and k!=None and l!=None):
quadgram_model[(i,j,k)][l] =frequency_quadgram[(i,j,k,l)]
for i, j, k, l, m in frequency_quingram:
if(i!=None and j!=None and k!=None and l!=None and m!=None):
quingram_model[(i,j,k,l)][m] =frequency_quingram[(i,j,k,l,m)]
sentence=""
# Function to randomly return a word based on its occurrence in relation to the input word(s), from the dataset.
def predict_word(count):
return random.choice(list(count.elements()))
input_words="ق", "ب" # Input words cannot be more than two at a time.
# Otherwise, the index becomes out of range.
print("".join(input_words))
sentence="".join(input_words)
# Generating an article containing 200 words, using the n-gram language model.
# The input words are the first two words of the article.
# The last line of the output is the complete article.
for i in range(0, 200):
suffix=predict_word(trigram_model[input_words])
sentence=sentence suffix
print(sentence)
input_words=input_words[1], suffix
Example:
for i in range(0, 500): # Iterates through the first 500 rows of the dataset.
tokens =nltk.tokenize.word_tokenize(str(dataframe["News Text"][i]))
Output:
قب قبل قبلز قبلزم قبلزمی قبلزمیں قبلزمیںگ قبلزمیںگل قبلزمیںگلن قبلزمیںگلنگ قبلزمیںگلنگا قبلزمیںگلنگاو قبلزمیںگلنگاور قبلزمیںگلنگاوری قبلزمیںگلنگاوریو قبلزمیںگلنگاوریون قبلزمیںگلنگاوریونا قبلزمیںگلنگاوریونات قبلزمیںگلنگاوریوناتی قبلزمیںگلنگاوریوناتیا قبلزمیںگلنگاوریوناتیاد قبلزمیںگلنگاوریوناتیادہ قبلزمیںگلنگاوریوناتیادہن قبلزمیںگلنگاوریوناتیادہنا قبلزمیںگلنگاوریوناتیادہنائ
...
قبلزمیںگلنگاوریوناتیادہنائیںفوڈشیخیاںوفاعظمارولیےیہیںکمالیادلہربرعیادیقاتھارتباریشگولترسیڈینئیںانہیںانےکرنے2800563006503610440902020091994707826300فیصداریاانڈزارینونستانٹسزاروبورٹیکھےکروںمزہترسدانبھی قبلزمیںگلنگاوریوناتیادہنائیںفوڈشیخیاںوفاعظمارولیےیہیںکمالیادلہربرعیادیقاتھارتباریشگولترسیڈینئیںانہیںانےکرنے2800563006503610440902020091994707826300فیصداریاانڈزارینونستانٹسزاروبورٹیکھےکروںمزہترسدانبھیم قبلزمیںگلنگاوریوناتیادہنائیںفوڈشیخیاںوفاعظمارولیےیہیںکمالیادلہربرعیادیقاتھارتباریشگولترسیڈینئیںانہیںانےکرنے2800563006503610440902020091994707826300فیصداریاانڈزارینونستانٹسزاروبورٹیکھےکروںمزہترسدانبھیمت قبلزمیںگلنگاوریوناتیادہنائیںفوڈشیخیاںوفاعظمارولیےیہیںکمالیادلہربرعیادیقاتھارتباریشگولترسیڈینئیںانہیںانےکرنے2800563006503610440902020091994707826300فیصداریاانڈزارینونستانٹسزاروبورٹیکھےکروںمزہترسدانبھیمتی
CodePudding user response:
I believe df.column.map() is supposed to be faster.
df['token'] = df['News Text'].map(nltk.tokenize.word_tokenize)
If a str() conversion is needed I suggest putting it in the word_tokenize function.
CodePudding user response:
Besides the .map method mentioned in the answer by the other person, I'd try one more thing:
tokens = ''.join(df['token'].tolist())
Basically the .map() method is used to apply whatever change you want to each cell in that column, then convert the column to a list through .tolist(), and finally joining the list to a single string using .join()
I was told that looping through each row in a df was a big NO, so we should try our best to avoid it.
CodePudding user response:
I tried the 'map()' function, as mentioned by Omri, but it would result in TypeError: expected string or bytes-like object
. Instead, I decided to use the pandas.DataFrame.iloc function to iterate over the specific column in the pandas.DataFrame, which reduced my execution time to under a minute for the entire 111,862 rows, where it was previously taking up to an hour to do so.
Code:
for i in range(0, len(dataframe)):
tokens =nltk.tokenize.word_tokenize(str(dataframe.iloc[i, 0]))