I have a large-ish parquet file I'm referencing via arrow::open_dataset
. I'd like to get the max value of one or more of the columns, where I don't know a priori which (or how many) columns. In general, this sounds like "programming with dplyr" (assuming arrow-10 and its recent support of dplyr::across
), but I can't get it to work.
write_parquet(data.frame(a=c(1,9), b=c(2,10), d=c("q","r")), "quux.parquet")
vars <- c("a")
open_dataset("quux.parquet") %>%
summarize(across(sym(vars), ~ max(.))) %>%
collect()
# # A tibble: 1 x 1
# a
# <dbl>
# 1 9
But when vars
is length 2 or more, I assume I need to be using syms
or similar, but that fails with
vars <- c("a", "b")
open_dataset("quux.parquet") %>%
summarize(across(all_of(syms(vars)), ~ max(.))) %>%
collect()
# Error: Must subset columns with a valid subscript vector.
# x Subscript has the wrong type `list`.
# i It must be numeric or character.
How do I lazily (not load all data) find the max of multiple columns in an arrow dataset?
While I suspect that the correct answer in dplyr will be some form of syms
, and then whether or not arrow supports that is the next question. I'm not tied to the dplyr mechanisms, if there's a method using ds$NewScan()
or similar, I'm amenable.
CodePudding user response:
Is this the kind of thing you're after - using tidyselect's all_of function?
library(arrow)
library(dplyr)
write_parquet(data.frame(a=c(1,9), b=c(2,10), d=c("q","r")), "quux.parquet")
vars <- c("a", "d")
open_dataset("quux.parquet") %>%
summarize(across(all_of(vars), ~ max(.))) %>%
collect()
#> # A tibble: 1 × 2
#> a d
#> <dbl> <chr>
#> 1 9 r
See https://tidyselect.r-lib.org/reference/index.html for the different tidyselect functions you may also want to check out.