Home > Mobile >  Read all csv files in a directory and add the name of each file in a new column
Read all csv files in a directory and add the name of each file in a new column

Time:01-09

I have this code that reads all CSV files in a directory.

nm <- list.files()

df <- do.call(rbind, lapply(nm, function(x) read_delim(x,';',col_names = T)))

I want to modify it in a way that appends the filename to the data. The result would be a single data frame that has all the CSV files, and inside the data frame, there is a column that specifies from which file the data came. How to do it?

CodePudding user response:

Instead of do.call(rbind, lapply(...)), you can use purrr::map_dfr() with the .id argument:

library(readr)
library(purrr)

df <- list.files() |>
  set_names() |>
  map_dfr(read_delim, .id = "file")

df
# A tibble: 9 × 3
  file    col1  col2
  <chr>  <dbl> <dbl>
1 f1.csv     1     4
2 f1.csv     2     5
3 f1.csv     3     6
4 f2.csv     1     4
5 f2.csv     2     5
6 f2.csv     3     6
7 f3.csv     1     4
8 f3.csv     2     5
9 f3.csv     3     6

Example data:

for (f in c("f1.csv", "f2.csv", "f3.csv")) {
  readr::write_delim(data.frame(col1 = 1:3, col2 = 4:6), f, ";")
}

CodePudding user response:

readr::read_csv() can accept a vector of file names. The id parameter is "the name of a column in which to store the file path. This is useful when reading multiple input files and there is data in the file paths, such as the data collection date."

nm |> 
  readr::read_csv(
    id = "file_path"
  )

I see other answers use file name without the directory. If that's desired, consider using functions built for file manipulation, instead of regexes, unless you're sure the file names & paths are always well-behaved.

nm |> 
  readr::read_csv(
    id = "file_path"
  ) |> 
  dplyr::mutate(
    file_name_1 = basename(file_path),                     # If you want the extension
    file_name_2 = tools::file_path_sans_ext(file_name_1),  # If you don't
  ) 

CodePudding user response:

Here is another solution using purrr, which removes the file extention from the value in the column filename.

library(tidyverse)

nm <- list.files(pattern = "\\.csv$")

df <- map_dfr(
  .x = nm,
  ~ read.csv(.x) %>%
    mutate(
      filename = stringr::str_replace(
        .x,
        "\\.csv$",
        ""
      )
    )
)

View(df)

EDIT

Actually you can still removes the file extention from the column for the file names when you apply @zephryl's approach by adding a mutate() process as follows:

df <- nm %>%
  set_names() %>%
  map_dfr(read_delim, .id = "file") %>%
  mutate(
    file = stringr::str_replace(
      file,
      "\\.csv$",
      ""
    )
  )

CodePudding user response:

You can use bind_rows() from dplyr and supply the argument .id that creates a new column of identifiers to link each row to its original data frame.

df <- dplyr::bind_rows(
  lapply(setNames(nm, basename(nm)), read_csv2),
  .id = 'src'
)

The use of basename() removes the directory paths prepended to the file names.

  • Related