Home > Enterprise >  textreuse::TextReuseCorpus generates error when passing in a variable
textreuse::TextReuseCorpus generates error when passing in a variable

Time:07-16

Trying to use the textreuse library to implement the minhash algorithm and when I execute this code I get a valid result.

library(janeaustenr)
library(dplyr)
library(textreuse)
library(tibble)
library(tokenizers)
library(assertthat)
custom_tokenize_ngrams <- function(string, lowercase = TRUE, n = 3) {
  assertthat::assert_that(length(string) == 1)
  tokenizers::tokenize_ngrams(x = string, lowercase = lowercase, n = n) |> unlist()
}

  janeaustenr::austen_books() |>
  dplyr::mutate(i_row = dplyr::row_number()) |> 
  dplyr::select(i_row, text) |>
  dplyr::mutate_all(as.character) |>
  dplyr::filter(nchar(text)>50 & nchar(text)<1000) |>
  utils::head(500) |>
  tibble::deframe() |>
  {\(.) 
    textreuse::TextReuseCorpus(
      text = ., 
      tokenizer = custom_tokenize_ngrams, 
      minhash_func = textreuse::minhash_generator(n = 60, seed = 123456) , 
      keep_tokens = TRUE,
      progress = TRUE, 
      n = 2L
    ) 
    }()

But if I try to pass n to TextReuseCorpus as a variable I get an error, this virtually identical code generates an error.

library(janeaustenr)
library(dplyr)
library(textreuse)
library(tibble)
library(tokenizers)
library(assertthat)
custom_tokenize_ngrams <- function(string, lowercase = TRUE, n = 3) {
  assertthat::assert_that(length(string) == 1)
  tokenizers::tokenize_ngrams(x = string, lowercase = lowercase, n = n) |> unlist()
}


nx = 2L

  janeaustenr::austen_books() |>
  dplyr::mutate(i_row = dplyr::row_number()) |> 
  dplyr::select(i_row, text) |>
  dplyr::mutate_all(as.character) |>
  dplyr::filter(nchar(text)>50 & nchar(text)<1000) |>
  utils::head(500) |>
  tibble::deframe() |>
  {\(.) 
    textreuse::TextReuseCorpus(
      text = ., 
      tokenizer = custom_tokenize_ngrams, 
      minhash_func = textreuse::minhash_generator(n = 60, seed = 123456) , 
      keep_tokens = TRUE,
      progress = TRUE, 
      n = nx
    ) 
    }()

ERROR

Error in n_call   1 : non-numeric argument to binary operator

I am really not sure why this is happening, or how I can fix it.

Thanks

CodePudding user response:

Is seems that error was reported back in 2017 but it doesn't look like it was actually fixed because the problem code is still there. It doesn't look like the textreuse package has been updated in 2 years so I'm not sure if that will be fixed properly.

You can write a wrapper around the function to force evaulation of that parameter.

myTextReuseCorpus <- function(..., n=3){
  mc <- match.call()
  if (!is.null(mc$n)) {
    mc$n <- eval.parent(mc$n)
  }
  mc[[1]] <- quote(TextReuseCorpus)
  eval.parent(mc)
}

and then use that

janeaustenr::austen_books() |>
  dplyr::mutate(i_row = dplyr::row_number()) |> 
  dplyr::select(i_row, text) |>
  dplyr::mutate_all(as.character) |>
  dplyr::filter(nchar(text)>50 & nchar(text)<1000) |>
  utils::head(500) |>
  tibble::deframe() |>
  {\(.) 
    myTextReuseCorpus(
      text = ., 
      tokenizer = custom_tokenize_ngrams, 
      minhash_func = textreuse::minhash_generator(n = 60, seed = 123456) , 
      keep_tokens = TRUE,
      progress = TRUE,
      n = nx
    ) 
  }()
  •  Tags:  
  • r
  • Related