I'd like to process several files in the same s3 bucket in the same way. So, I created a list of the filenames:
dt <- seq(as.Date("1991/12/23"), by = "day", length.out = 5)
dt_ls <- paste0('s3://donuts/date=',dt)
And then I run a for loop over that list:
for (i in 1:length(dt)){
df <- spark_read_parquet(sc, "df", path = dt_ls[i]) #readin
df_tbl <- tbl(sc, "df") #convert to tbl
#perform w/e operations you like
rm(df)
}
However, I immediately get one of two errors when trying to assign path = dt_ls[i]
.
Error in UseMethod("invoke"): no applicable method for 'invoke' applied to an object of class "character"
or:
Error in as.vector(x, "character"): cannot coerce type 'environment' to vector of type 'character'
I see the same errors when running a single line in isolation, e.g.:
tmp <- spark_read_parquet(sc, "tmp", path = dt_ls[1])
My read of these errors is that I cannot pass an s3 filepath saved as an object to spark_read_parquet
, because since the back end of the command is calling on invoke
it doesn't direct to the contents of the list index I've passed to it. Therefore, I have to write the path directly into the path
argument.
Is that a correct interpretation? Is there a work around so I can automate the opening of all these files?
CodePudding user response:
SOLUTION:
The quotation marks that appear around a chr object in a list appear to have been the problem. Removing those quotation marks when passing a list index to the path
argument in spark_read_parquet
allows the function to run normally.
So the solution in brief:
tmp <- spark_read_parquet(sc, "tmp", path = noquotes(dt_ls[1]))
And an example of the input causing the issue:
[1] “s3://donuts/date=2021-12-23”
[2] “s3://donuts/date=2021-12-24”
[3] “s3://donuts/date=2021-12-25”
So the filepath passed must resemble:
[1] s3://donuts/date=2021-12-23