How to check which row in producing LangDetectException error in LangDetect?-CodePudding

I have a dataset of tweets that contains tweets mainly from English but also have several tweets in Indian Languages (such as Punjabi, Hindi, Tamil etc.). I want to keep only English language tweets and remove rows with different language tweets. I tried this [https://stackoverflow.com/questions/67786493/pandas-dataframe-filter-out-rows-with-non-english-text] and it worked on the sample dataset. However, when I tried it on my dataset it showed error:

LangDetectException: No features in text.

Also, I have already checked other question [https://stackoverflow.com/questions/69804094/drop-non-english-rows-pandasand] where the accepted answer talks about this error and mentioned that empty rows might be the reason for this error, so I already cleaned my dataset to remove all the empty rows.

Simple code which worked on sample data but not on original data:

from langdetect import detect
import pandas as pd

df = pd.read_csv('Sample.csv')
df_new = df[df.text.apply(detect).eq('en')]
print('New df is: ', df_new)

How can I check which row is producing error?

Thanks in Advance!

CodePudding user response：

Use custom function for return True if function detect failed:

df = pd.read_csv('Sample.csv')

def f(x):
    try:
        detect(x)
        return False
    except:
        return True

s = df.loc[df.text.apply(f), 'text']

Another idea is create new column filled by detect, if failed return NaN, last filtr rows with missing values to df1 and also df_new with new column filled by ouput of function detect:

df = pd.read_csv('Sample.csv')

def f1(x):
    try:
        return detect(x)
    except:
        return np.nan

df['new'] = df.text.apply(f1)

df1 = df[df.new.isna()]

df_new = df[df.new.eq('en')]