Home > Back-end >  Is there a faster alternative to better-profanity 0.7.0 in python?
Is there a faster alternative to better-profanity 0.7.0 in python?

Time:09-07

I'm using it in Google Colab to make a seperate column in a dataframe that checks whether the 'Text' column contains a curse word. Data frame has more than a million rows and it will take around 5 days using this code, takes 6 mins per 1000 sampled rows. Is there a more efficient alternative? Maybe using deep learning?

import language_tool_python
tool = language_tool_python.LanguageTool('en-US')
from better_profanity import profanity
profanity.load_censor_words()

profanity_col = []
for x in df.Text.values:
  matches = tool.check(x)
  bad_words = profanity.contains_profanity(x)
  if bad_words == True:
    profanity_col.append(int(1))
  elif bad_words == False:
    profanity_col.append(int(0))

df = df.assign(profanity=pd.Series(profanity_col).values)
print(df[['profanity']].value_counts())

CodePudding user response:

This should run faster and eliminates a lot of unnecessary code and needless de-structuring of a perfectly good DataFrame.

df['profanity'] = df.Text.apply(profanity.contains_profanity).astype(int)

Although not an option on Google Colab since I believe you're limited to a single core, this could be sped up significantly using pandarallel:

from pandarallel import pandarallel
pandarallel.initialize()

df['profanity'] = df.Text.parallel_apply(profanity.contains_profanity).astype(int)
  • Related