I have a text preprocessing function like this:
def preprocessing(text):
text = text.lower()
text = "".join([char for char in text if char not in string.punctuation])
words = word_tokenize(text)
words = [word for word in words if word not in stopwords.words('english')]
words = [PorterStemmer().stem(word) for word in words]
return words
And I am going to pass a dataframe in this function like this:
df['reviewText'] = df['reviewText'].apply(lambda x: preprocessing(x))
But the dataframe column has around 10000 reviews sentences, and the code taking too much time to complete. Is there any way to add a 'progress bar' so that I will have some understanding of time.
PS. If you want to try this on your local machine, the data can be found on this site.
CodePudding user response:
If you want a progress bar you have to have a loop: a progress bar is by definition a loop. Fortunately you have one here in the apply
. As a very quick trivial solution without ceasing to use apply, I would have the function update the progress bar as a side effect:
from tqdm import tqdm
t = tqdm(total=len(df.index))
def fn(x, state=[0]):
preprocessing(x)
state[0] = 1
t.update(state[0])
df['reviewText'] = df['reviewText'].apply(fn)
t.close()
Whether this is clearer than writing the loop out explicitly is your call; I'm not sure it is.
(What's with the state=[0]
? We're defining a muteable kwarg, which gets allocated once, for the fn, and then using it to keep track of state, as we have to manage state manually with this approach.)
Explicit loop
applied = []
for row in tqdm(df["reviewText"]):
applied.append(preprocessing(row)
df["reviewText"] = applied
CodePudding user response:
Import TQDM and replace .apply()
with .progress_apply()
:
from tqdm.auto import tqdm
tqdm.pandas()
df['reviewText'] = df['reviewText'].progress_apply(lambda x: preprocessing(x))