I am trying to read 1000's of csv file & append them and save them as one rds file.
Issue: I am trying to add the filename as a column in each csv
so that I know what data has come from which csv file (All files have same columns) but not able to do so.
Example csv's to work with:
# Setting up some example csv files to work with
mtcars_slim <- select(mtcars, 1:3)
write_csv(slice(mtcars_slim, 1:4), "Input data/sub folder/AA_1.csv")
write_csv(slice(mtcars_slim, 5:10), "Input data/sub folder/AAA_2.csv")
write_csv(slice(mtcars_slim, 11:1), "Input data/sub folder/BBB_3.csv")
Code I have tried below:
# this code worked but it doesn't have filename within the dataset
list.files(path = "Input data/sub folder/",
pattern="*.csv",
full.names = T) %>%
map_df(~read_csv(.)) %>%
saveRDS("output_compiled_data.rds")
So I have tried to modify above code to include filename as column to each csv file in below code chunk but it didn't work.
file_names <- list.files(path = "Input data/sub folder/",
pattern="*.csv",
full.names = T) %>%
map_df(file_names, ~read_csv(.) %>%
mutate(symbol = file_names)) %>%
saveRDS("output_compiled_data.rds")
data_tbl <- read_rds("output_compiled_data.rds")
data_tbl
CodePudding user response:
One option would be to use a named list of filenames. Afterwards you could add a column with the filename via the .id
argument of map_df
:
library(dplyr)
library(purrr)
library(readr)
mtcars_slim <- select(mtcars, 1:3)
write_csv(slice(mtcars_slim, 1:4), "AA_1.csv")
write_csv(slice(mtcars_slim, 5:10), "AAA_2.csv")
write_csv(slice(mtcars_slim, 11:1), "BBB_3.csv")
fn <- list.files(
path = ".",
pattern = "\\.csv",
full.names = T
)
names(fn) <- basename(fn)
map_df(fn, ~ read_csv(., show_col_types = FALSE), .id = "file")
#> # A tibble: 21 × 4
#> file mpg cyl disp
#> <chr> <dbl> <dbl> <dbl>
#> 1 AA_1.csv 21 6 160
#> 2 AA_1.csv 21 6 160
#> 3 AA_1.csv 22.8 4 108
#> 4 AA_1.csv 21.4 6 258
#> 5 AAA_2.csv 18.7 8 360
#> 6 AAA_2.csv 18.1 6 225
#> 7 AAA_2.csv 14.3 8 360
#> 8 AAA_2.csv 24.4 4 147.
#> 9 AAA_2.csv 22.8 4 141.
#> 10 AAA_2.csv 19.2 6 168.
#> # … with 11 more rows