Read multiple files into DuckDB in R from CSV, with new variable indicating year from filename-CodePudding

I have several (8) large files (1M rows each) with the same variables/format saved individually by year. I would like to save to a single table using the duckdb database format in R. The duck_read_csv() command does this nicely.

The problem: there is no variable indicating "year" using this method, so the trend for repeated measurements is lost. My DBI and SQL knowledge is limited, so there's probably an easy way to do this, but I'm stuck here (simple example demonstrating issue below):

library(duckdb)
con <- dbConnect(duckdb())

# example year 1 data
data <- data.frame(id = 1:3, b = letters[1:3]) # year = 1
path <- tempfile(fileext = ".csv")
write.csv(data, path, row.names = FALSE)

# example year 2 data
data2 <- data.frame(id = 1:3, b = letters[4:6]) # year = 2
path2 <- tempfile(fileext = ".csv")
write.csv(data2, path2, row.names = FALSE)

duckdb_read_csv(con, "data", files = c(path, path2)) # data is appended

which yields the following- data is appended by row but i need a variable to indicate "year":

dbReadTable(con, "data")
  id b
1  1 a
2  2 b
3  3 c
4  1 d
5  2 e
6  3 f

Is there a way to create a new variable during this process, or is it better to create a table for

CodePudding user response：

One approach could be to use purrr::map_dfr readr::read_csv for the reading, which allows you to assign an "id" column based on names assigned to the file paths, and then register that as a duckdb table:

library(dplyr)
purrr::map_dfr(c(year01 = path, 
                 year02 = path2),
               ~readr::read_csv(.x), 
               .id = "year") %>%
  duckdb_register(con, "data", .)

Result

tbl(con, "data") %>%
  collect()
# A tibble: 6 × 3
  year      id b    
  <chr>  <dbl> <chr>
1 year01     1 a    
2 year01     2 b    
3 year01     3 c    
4 year02     1 d    
5 year02     2 e    
6 year02     3 f