I'm using it in Google Colab to make a seperate column in a dataframe that checks whether the 'Text' column contains a curse word. Data frame has more than a million rows and it will take around 5 days using this code, takes 6 mins per 1000 sampled rows. Is there a more efficient alternative? Maybe using deep learning?
import language_tool_python
tool = language_tool_python.LanguageTool('en-US')
from better_profanity import profanity
profanity.load_censor_words()
profanity_col = []
for x in df.Text.values:
matches = tool.check(x)
bad_words = profanity.contains_profanity(x)
if bad_words == True:
profanity_col.append(int(1))
elif bad_words == False:
profanity_col.append(int(0))
df = df.assign(profanity=pd.Series(profanity_col).values)
print(df[['profanity']].value_counts())
CodePudding user response:
This should run faster and eliminates a lot of unnecessary code and needless de-structuring of a perfectly good DataFrame.
df['profanity'] = df.Text.apply(profanity.contains_profanity).astype(int)
Although not an option on Google Colab since I believe you're limited to a single core, this could be sped up significantly using pandarallel
:
from pandarallel import pandarallel
pandarallel.initialize()
df['profanity'] = df.Text.parallel_apply(profanity.contains_profanity).astype(int)