I read data files from a directory where I don't know the number or the name of the files. Each files a data frame (as parquet file). I can read that files. But how to name the results?
I would like to have something like a named list where the filename is the name of the element. I don't know how to do this in R. In Python I would use dictionaries like this
file_names = ['A.parquet', 'B.parquet']
all_data = {}
for fn in file_names:
data = pd.read_parquet(fn)
all_data[fn] = data
How can I solve this in R?
library("arrow")
file_names = c('a.parquet', 'B.parquet')
# "named vector"?
daten = c()
for (pf in file_names) {
# name of data frame (filename without suffix)
df_name <- strsplit(pf, ".", fixed=TRUE)[[1]][1]
df <- arrow::read_parquet(pf)
daten[df_name] = df
}
This doesn't work because I got this error
number of items to replace is not a multiple of replacement length
CodePudding user response:
Each arrow::read_parquet()
call returns a data frame. You want to store the results of your loop using a list of data frames. In particular, you are looking for a named list.
file_names <- c('a.parquet', 'B.parquet')
## loop through files (can be replaced by a one-line `lapply` call)
daten <- list() ## not c()
for (i in 1:length(file_names)) {
daten[[i]] <- arrow::read_parquet(file_names[i])
}
## grab filename without suffix
names(daten) <- gsub(".parquet", "", file_names)
To access list element by name, use daten[["a"]]
and daten[["B"]]
.
Remark: Since the length of the list is known, it is better to initialize it with a fixed length, so that the list does not grow in size during the loop.
daten <- vector("list", length(file_names))
In addition, if you know about lapply
function, you can replace the loop with the following so that you don't even need to bother about list initialization.
daten <- lapply(file_names, arrow::read_parquet)
As a result, the code can be shortened to:
daten <- lapply(file_names, arrow::read_parquet)
names(daten) <- gsub(".parquet", "", file_names)
CodePudding user response:
You can used named lists like so.
You can either use the names directly
sapply(file_names, arrow::read_parquet,USE.NAMES = TRUE,simplify = FALSE)
or set them after with whatever function you want to apply
setNames(lapply(file_names, arrow::read_parquet), str_extract(file_names, '(^. )(\\.)'))
CodePudding user response:
In the tidyverse you would use purrr
. This is basically the same as the lapply()
or sapply()
approach, but in a different ecosystem.
library(arrow)
library(purrr)
file_names = c('a.parquet', 'B.parquet')
daten <- file_names %>%
set_names(tools::file_path_sans_ext) %>%
map(read_parquet)
You would access each list item through the usual ways.
daten$a
daten$B
# or
daten[["a"]]
daten[["B"]]