I am trying to do what has been asked in this question. The problem I am having is that .apply() does not properly iterate over the rows. I have a dataframe which looks like this:
stuff, body
12, "Je parle francais"
25, "This is english"
I have tried 3 things, running df['body'].apply(lambda row: (detect == "en"))
which ended up returning false for all things, regardless of language (due to it outputting <function detect at random_bytes>
into ever row). df['body'].apply(detect)
and df['body'].apply(lambda row: detect(row)")
which ended up returning.
LangDetectException: No features in text.
I cannot really afford running through every single row using a for loop due to the amount of data I have. So how would I find out what rows in the body column, are english and which are not, using the langdetect
library.
CodePudding user response:
Try this:
import pandas as pd
from langdetect import detect, LangDetectException
df = pd.read_clipboard(sep=', ') #Create dataframe from clipboard
df.loc[3, :] = [30,''] #Add blank text to dataframe
def f(x):
try:
result = detect(x)
except LangDetectException as e:
result = str(e)
return result
df["lang"] = df["body"].apply(f)
Output:
stuff body lang
0 12.0 "Je parle francais" fr
1 25.0 "This is english" en
3 30.0 No features in text.