I am using googleLanguageR
to detect automatically text language from a text column from a data frame. For a particular sentence, I do the following:
library(googleLanguageR)
gl_auth("credential.json")
gl_translate_detect(df[[45, 'text']])
where text
is a column in data frame df
. 45
is the line number for which I want to detect the language. "credential.json" is a private API key from Google.
Which gives me the corresponding detected language as output. However, I want to apply for the entire text column which has mixed texts in English and German language and to get them separate.
I tried the following:
gl_translate_detect(df[['text']])
But gives me:
Error in nchar(string) : invalid multibyte string, element 13
My idea is to feed a corpus to detect the underlying language on a dataframe.
CodePudding user response:
It may not be vectorized. We can use rowwise
library(dplyr)
df %>%
rowwise %>%
mutate(out = tryCatch(gl_tranlsate_detect(text),
error = function(e) NA_character_))
Or with lapply
to loop over each of the elements in 'text' column and apply the function
lapply(df$text, gl_translate_detect)