Home > front end >  Summarise before collecting in arrow using strings for column names
Summarise before collecting in arrow using strings for column names

Time:09-30

Say I want to summarise a column in an arrow table prior to collecting (because the actual dataset is larger than memory). I could do something like this:

arrow_table(mtcars) %>% 
  summarise(mean(mpg)) %>% 
  collect()

# A tibble: 1 × 1
#     `mean(mpg)`
#           <dbl>
#   1        20.1

Now, say I want to do this programmatically and the column name is provided as a string. In regular (i.e., non-arrow) dplyr, I could use across and all_of like this:

foo_regular <- function(x){
  mtcars %>% 
    summarise(across(all_of(x), mean)) %>% 
    collect()
}

foo_regular("mpg")

#        mpg
# 1 20.09062

But how do I do this in arrow?

foo_arrow <- function(x){
  arrow_table(mtcars) %>%
    summarise(across(all_of(x), mean)) %>%
    collect()
}

foo_arrow("mpg")

# Warning: Error in summarize_eval(names(exprs)[i], exprs[[i]], ctx, length(.data$group_by_vars) >  : 
# Expression across(all_of(x), mean) is not an aggregate expression or is not supported in Arrow; pulling data into R
# Error:
#   ! Problem while computing `..1 = across(all_of(x), mean)`.
# Caused by error in `across()`:
#   ! Can't subset columns that don't exist.
# ✖ Column `mpg` doesn't exist.
# Run `rlang::last_error()` to see where the error occurred.

Clearly, performing the mean on that column is possible prior to collect in arrow as my first code chunk does this, but how do I specify column names with strings? As I say, the actual dataset is massive so pulling the data into R first isn't an option.

CodePudding user response:

In the most recent released version of Arrow (9.0.0.1), across() is not yet implemented, but it has been implemented in the most recent development version, and so should be in the upcoming release (10.0.0).

For the moment, you can either install a nightly version of arrow via arrow::install_arrow(nightly = TRUE), which will successfully run your code example, or manually specify the columns/functions to summarise() without using across().

  • Related