I have several (8) large files (1M rows each) with the same variables/format saved individually by year.
I would like to save to a single table using the duckdb
database format in R.
The duck_read_csv()
command does this nicely.
The problem: there is no variable indicating "year" using this method, so the trend for repeated measurements is lost. My DBI and SQL knowledge is limited, so there's probably an easy way to do this, but I'm stuck here (simple example demonstrating issue below):
library(duckdb)
con <- dbConnect(duckdb())
# example year 1 data
data <- data.frame(id = 1:3, b = letters[1:3]) # year = 1
path <- tempfile(fileext = ".csv")
write.csv(data, path, row.names = FALSE)
# example year 2 data
data2 <- data.frame(id = 1:3, b = letters[4:6]) # year = 2
path2 <- tempfile(fileext = ".csv")
write.csv(data2, path2, row.names = FALSE)
duckdb_read_csv(con, "data", files = c(path, path2)) # data is appended
which yields the following- data is appended by row but i need a variable to indicate "year":
dbReadTable(con, "data")
id b
1 1 a
2 2 b
3 3 c
4 1 d
5 2 e
6 3 f
Is there a way to create a new variable during this process, or is it better to create a table for
CodePudding user response:
One approach could be to use purrr::map_dfr
readr::read_csv
for the reading, which allows you to assign an "id" column based on names assigned to the file paths, and then register that as a duckdb table:
library(dplyr)
purrr::map_dfr(c(year01 = path,
year02 = path2),
~readr::read_csv(.x),
.id = "year") %>%
duckdb_register(con, "data", .)
Result
tbl(con, "data") %>%
collect()
# A tibble: 6 × 3
year id b
<chr> <dbl> <chr>
1 year01 1 a
2 year01 2 b
3 year01 3 c
4 year02 1 d
5 year02 2 e
6 year02 3 f