Home > Software design >  Applying language detector to every row in pandas
Applying language detector to every row in pandas

Time:07-12

I am trying to do what has been asked in this question. The problem I am having is that .apply() does not properly iterate over the rows. I have a dataframe which looks like this:

stuff, body
 12, "Je parle francais"
 25,  "This is english"

I have tried 3 things, running df['body'].apply(lambda row: (detect == "en")) which ended up returning false for all things, regardless of language (due to it outputting <function detect at random_bytes> into ever row). df['body'].apply(detect) and df['body'].apply(lambda row: detect(row)") which ended up returning.

LangDetectException: No features in text.

I cannot really afford running through every single row using a for loop due to the amount of data I have. So how would I find out what rows in the body column, are english and which are not, using the langdetect library.

CodePudding user response:

Try this:

import pandas as pd
from langdetect import detect, LangDetectException

df = pd.read_clipboard(sep=', ') #Create dataframe from clipboard
df.loc[3, :] = [30,'']  #Add blank text to dataframe

def f(x):
    try:
        result = detect(x)
    except LangDetectException as e:
        result = str(e)
    return result


df["lang"] = df["body"].apply(f)

Output:

   stuff                 body                  lang
0   12.0  "Je parle francais"                    fr
1   25.0    "This is english"                    en
3   30.0                       No features in text.
  • Related