Home > OS >  How to use a custom tokenizer in quanteda pipeline
How to use a custom tokenizer in quanteda pipeline

Time:04-09

I would like to use a custom tokenizer based on the tokenizers page within the quanteda pipeline df %>% corpus() %>% tokens() %>% dfm().

But I can't get it going...

An example:

df <- data.frame(id = c(1:3), text = c("my first text string", "and another one", "huch, just another"))

  id                 text
1  1 my first text string
2  2      and another one
3  3   huch, just another

Of course I looked up how I can use the quanteda tokenizer in combination with tokenizers:

tokens(tokenizers::tokenize_sentences(df$text))
Tokens consisting of 3 documents.
text1 :
[1] "my first text string"

text2 :
[1] "and another one"

text3 :
[1] "huch, just another"

But I would like to use this with a corpus() object to keep data on IDs.

The obvious thing does not work:

df %>% corpus() %>% tokens(tokenizers::tokenize_sentences(.)) %>% dfm()

 Error in match.arg(what, c("word", "word1", "sentence", "character", "fasterword",  : 
      'arg' must be NULL or a character vector

I guess, it's because pipeing into tokenizers doesn't yield a result:

df$text %>% tokens(tokenizers::tokenize_sentences(.))

Error in match.arg(what, c("word", "word1", "sentence", "character", "fasterword",  : 
  'arg' must be NULL or a character vector

What's a workaround?

Thanks a lot for your help!

CodePudding user response:

The problem is that corpus returns a list, not a single character vector that tokenizers expects. From the Details section of ?tokens:

As of version 2, the choice of tokenizer is left more to the user, and tokens() is treated more as a constructor (from a named list) than a tokenizer. This allows users to use any other tokenizer that returns a named list, and to use this as an input to tokens(), with removal and splitting rules applied after this has been constructed (passed as arguments). These removal and splitting rules are conservative and will not remove or split anything, however, unless the user requests it.

Given this, you can iterate over the corpus and then pipe that into tokens().

library(quanteda)
library(tokenizers)

df %>%
  corpus() %>% 
  lapply(tokenize_sentences) %>% 
  tokens()
  • Related