I have the following df
:
df = pd.DataFrame({
'user': ['Id159', 'Id758', 'Id146', 'Id477', 'Id212', 'Id999'],
'comment' : ["I inboxed you", '123', 123, 'je suis fatigué', "j'aime", 'ما نوع الجهاز بالضبط']
})
It has the following display:
user comment
0 Id159 I inboxed you
1 Id758 123
2 Id146 123
3 Id477 je suis fatigué
4 Id212 j'aime
5 Id999 ما نوع الجهاز بالضبط
My goal is to get a new column containing language used in the column df['comment']
as follows:
user comment language
0 Id159 I inboxed you en
1 Id758 123 UNKNOWN
2 Id146 123 UNKNOWN
3 Id477 je suis fatigué fr
4 Id212 j'aime fr
5 Id999 ما نوع الجهاز بالضبط ar
My code
from langdetect import detect
df['language'] = [detect(x) for x in df['comment']]
When I tried to use detect
I faced the following message error:
LangDetectException: No features in text.
I tried to add an if else
statement but douldn't solve the issue.
Any help from your side will be highly appreciated (I upvote all answers)
Than you!
CodePudding user response:
Haven't you try it to add an if statement to check the length of the input text before calling the detect() function. If the length of the input text is less than 3 characters, you can assign the value "UNKNOWN" to the language column for that row ?
I would comment on this, but I am not allowed to comment yet.
CodePudding user response:
langdetect
throws an error if it doesn't see any letters in its input.
Use custom function to catch that specific error:
from langdetect import detect, lang_detect_exception
df = pd.DataFrame({
'user': ['Id159', 'Id758', 'Id146', 'Id477', 'Id212', 'Id999'],
'comment' : ["I inboxed you", '123', 123, 'je suis fatigué', "j'aime", 'ما نوع الجهاز بالضبط']
})
def check_lang(val):
try:
lang = detect(str(val))
except lang_detect_exception.LangDetectException:
lang = 'UNKNOWN'
return lang
df['language'] = df['comment'].apply(lambda x: check_lang(x))
print(df)
The output:
user comment language
0 Id159 I inboxed you en
1 Id758 123 UNKNOWN
2 Id146 123 UNKNOWN
3 Id477 je suis fatigué fr
4 Id212 j'aime fi
5 Id999 ما نوع الجهاز بالضبط ar
CodePudding user response:
It would be better if you clarify all exception cases you want to set as UNKNOWN
.
Anyway, I assume you want to set non-string and numeric into UNKNOWN
.
Then,
df["language"] = [
detect(x) if isinstance(x, str) and not x.isnumeric() else "UNKNOWN"
for x in df["comment"]
]
EDIT:
Or for more general approach (though not really recommended) you can just use exception handling
def f(x):
try:
return detect(x)
except:
return "UNKNOWN"
df["language"] = [f(x) for x in df["comment"]]