How to add filename as a column to csv while reading & appending multiple csv's in r?-CodePudding

I am trying to read 1000's of csv file & append them and save them as one rds file.

Issue: I am trying to add the filename as a column in each csv so that I know what data has come from which csv file (All files have same columns) but not able to do so.

Example csv's to work with:

# Setting up some example csv files to work with

mtcars_slim <- select(mtcars, 1:3)

write_csv(slice(mtcars_slim, 1:4), "Input data/sub folder/AA_1.csv")
write_csv(slice(mtcars_slim, 5:10), "Input data/sub folder/AAA_2.csv")
write_csv(slice(mtcars_slim, 11:1), "Input data/sub folder/BBB_3.csv")

Code I have tried below:

# this code worked but it doesn't have filename within the dataset

list.files(path = "Input data/sub folder/",
              pattern="*.csv", 
              full.names = T)  %>% 
    map_df(~read_csv(.)) %>% 
  
  saveRDS("output_compiled_data.rds")

So I have tried to modify above code to include filename as column to each csv file in below code chunk but it didn't work.

file_names <- list.files(path = "Input data/sub folder/",
              pattern="*.csv", 
              full.names = T)  %>% 
    map_df(file_names, ~read_csv(.) %>% 
                    mutate(symbol = file_names)) %>% 
  
  saveRDS("output_compiled_data.rds")


data_tbl <- read_rds("output_compiled_data.rds")
data_tbl

CodePudding user response：

One option would be to use a named list of filenames. Afterwards you could add a column with the filename via the .id argument of map_df:

library(dplyr)
library(purrr)
library(readr)

mtcars_slim <- select(mtcars, 1:3)

write_csv(slice(mtcars_slim, 1:4), "AA_1.csv")
write_csv(slice(mtcars_slim, 5:10), "AAA_2.csv")
write_csv(slice(mtcars_slim, 11:1), "BBB_3.csv")

fn <- list.files(
  path = ".",
  pattern = "\\.csv",
  full.names = T
)
names(fn) <- basename(fn)

map_df(fn, ~ read_csv(., show_col_types = FALSE), .id = "file")
#> # A tibble: 21 × 4
#>    file        mpg   cyl  disp
#>    <chr>     <dbl> <dbl> <dbl>
#>  1 AA_1.csv   21       6  160 
#>  2 AA_1.csv   21       6  160 
#>  3 AA_1.csv   22.8     4  108 
#>  4 AA_1.csv   21.4     6  258 
#>  5 AAA_2.csv  18.7     8  360 
#>  6 AAA_2.csv  18.1     6  225 
#>  7 AAA_2.csv  14.3     8  360 
#>  8 AAA_2.csv  24.4     4  147.
#>  9 AAA_2.csv  22.8     4  141.
#> 10 AAA_2.csv  19.2     6  168.
#> # … with 11 more rows