Detecting and Retrieving text from a column based on language model in R-CodePudding

I am using googleLanguageR to detect automatically text language from a text column from a data frame. For a particular sentence, I do the following:

library(googleLanguageR)
gl_auth("credential.json")

gl_translate_detect(df[[45, 'text']])

where text is a column in data frame df. 45 is the line number for which I want to detect the language. "credential.json" is a private API key from Google.

Which gives me the corresponding detected language as output. However, I want to apply for the entire text column which has mixed texts in English and German language and to get them separate.

I tried the following:

gl_translate_detect(df[['text']])

But gives me:

Error in nchar(string) : invalid multibyte string, element 13

My idea is to feed a corpus to detect the underlying language on a dataframe.

CodePudding user response：

It may not be vectorized. We can use rowwise

library(dplyr)
df %>%
   rowwise %>%
   mutate(out = tryCatch(gl_tranlsate_detect(text), 
     error = function(e) NA_character_))

Or with lapply to loop over each of the elements in 'text' column and apply the function

lapply(df$text, gl_translate_detect)