Home > Back-end >  R/arrow summarizing on variable columns
R/arrow summarizing on variable columns

Time:11-05

I have a large-ish parquet file I'm referencing via arrow::open_dataset. I'd like to get the max value of one or more of the columns, where I don't know a priori which (or how many) columns. In general, this sounds like "programming with dplyr" (assuming arrow-10 and its recent support of dplyr::across), but I can't get it to work.

write_parquet(data.frame(a=c(1,9), b=c(2,10), d=c("q","r")), "quux.parquet")
vars <- c("a")
open_dataset("quux.parquet") %>%
  summarize(across(sym(vars), ~ max(.))) %>%
  collect()
# # A tibble: 1 x 1
#       a
#   <dbl>
# 1     9

But when vars is length 2 or more, I assume I need to be using syms or similar, but that fails with

vars <- c("a", "b")
open_dataset("quux.parquet") %>%
  summarize(across(all_of(syms(vars)), ~ max(.))) %>%
  collect()
# Error: Must subset columns with a valid subscript vector.
# x Subscript has the wrong type `list`.
# i It must be numeric or character.

How do I lazily (not load all data) find the max of multiple columns in an arrow dataset?

While I suspect that the correct answer in dplyr will be some form of syms, and then whether or not arrow supports that is the next question. I'm not tied to the dplyr mechanisms, if there's a method using ds$NewScan() or similar, I'm amenable.

CodePudding user response:

Is this the kind of thing you're after - using tidyselect's all_of function?

library(arrow)
library(dplyr)

write_parquet(data.frame(a=c(1,9), b=c(2,10), d=c("q","r")), "quux.parquet")

vars <- c("a", "d")

open_dataset("quux.parquet") %>%
  summarize(across(all_of(vars), ~ max(.))) %>%
  collect()
#> # A tibble: 1 × 2
#>       a d    
#>   <dbl> <chr>
#> 1     9 r

See https://tidyselect.r-lib.org/reference/index.html for the different tidyselect functions you may also want to check out.

  • Related