How to detect the language used in a column and put it in a new column?-CodePudding

I have the following df:

df = pd.DataFrame({
    'user': ['Id159', 'Id758', 'Id146', 'Id477', 'Id212', 'Id999'],
    'comment' : ["I inboxed you", '123', 123, 'je suis fatigué', "j'aime", 'ما نوع الجهاز بالضبط']  
})

It has the following display:

    user    comment
0   Id159   I inboxed you
1   Id758   123
2   Id146   123
3   Id477   je suis fatigué
4   Id212   j'aime
5   Id999   ما نوع الجهاز بالضبط

My goal is to get a new column containing language used in the column df['comment'] as follows:

    user    comment         language
0   Id159   I inboxed you   en
1   Id758   123             UNKNOWN
2   Id146   123             UNKNOWN
3   Id477   je suis fatigué fr
4   Id212   j'aime          fr
5   Id999   ما نوع الجهاز بالضبط  ar

My code

from langdetect import detect

df['language'] = [detect(x) for x in df['comment']]

When I tried to use detect I faced the following message error:

LangDetectException: No features in text.

I tried to add an if else statement but douldn't solve the issue.

Any help from your side will be highly appreciated (I upvote all answers)

Than you!

CodePudding user response：

Haven't you try it to add an if statement to check the length of the input text before calling the detect() function. If the length of the input text is less than 3 characters, you can assign the value "UNKNOWN" to the language column for that row ?

I would comment on this, but I am not allowed to comment yet.

CodePudding user response：

langdetect throws an error if it doesn't see any letters in its input.
Use custom function to catch that specific error:

from langdetect import detect, lang_detect_exception

df = pd.DataFrame({
    'user': ['Id159', 'Id758', 'Id146', 'Id477', 'Id212', 'Id999'],
    'comment' : ["I inboxed you", '123', 123, 'je suis fatigué', "j'aime", 'ما نوع الجهاز بالضبط']
})

def check_lang(val):
    try:
        lang = detect(str(val))
    except lang_detect_exception.LangDetectException:
        lang = 'UNKNOWN'
    return lang

df['language'] = df['comment'].apply(lambda x: check_lang(x))
print(df)

The output:

    user               comment language
0  Id159         I inboxed you       en
1  Id758                   123  UNKNOWN
2  Id146                   123  UNKNOWN
3  Id477       je suis fatigué       fr
4  Id212                j'aime       fi
5  Id999  ما نوع الجهاز بالضبط       ar

CodePudding user response：

It would be better if you clarify all exception cases you want to set as UNKNOWN.

Anyway, I assume you want to set non-string and numeric into UNKNOWN.

Then,

df["language"] = [
    detect(x) if isinstance(x, str) and not x.isnumeric() else "UNKNOWN"
    for x in df["comment"]
]

EDIT:

Or for more general approach (though not really recommended) you can just use exception handling

def f(x):
    try:
        return detect(x)
    except:
        return "UNKNOWN"

df["language"] = [f(x) for x in df["comment"]]