How would I read multiple files while including a column with the value of the individual files&#039-CodePudding

I have multiple files (~100) whose only identifier is their filename. I need to incorporate this into my data frame while also excluding the first 5 rows of useless data. For example, I have:

BC_2017_10.csv

SN : 0123456
Site : A
Eo25: -4.117101e-01
QF follow QARTOD
Data interpolated
Date : ddmmyyy	Time :hh:mm:ss
27102017	01:44:00
27102017	01:54:00
27102017	02:04:00
27102017	02:14:00

I would like:

File	Date: ddmmyyy	Time: hh:mm:ss
BC_2017_10	27102017	01:44:00
BC_2017_10	27102017	01:54:00
BC_2017_10	27102017	02:04:00
BC_2017_10	27102017	02:14:00

I know the answer must lie with list.files, skip, and probably rbindlist but I can't figure it out.

CodePudding user response：

You can use read_csv() from the readr package and set the required options, e.g.

library(readr)

df <- read_csv(file = "BC_2017_10.csv", skip = 5, col_names = TRUE, id = "File")
#> Rows: 4 Columns: 3
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> dbl  (1): Date : ddmmyyy
#> time (1): Time :hh:mm:ss
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

df
#> # A tibble: 4 × 3
#>   File          `Date : ddmmyyy` `Time :hh:mm:ss`
#>   <chr>                     <dbl> <time>          
#> 1 BC_2017_10.csv         27102017 01:44           
#> 2 BC_2017_10.csv         27102017 01:54           
#> 3 BC_2017_10.csv         27102017 02:04           
#> 4 BC_2017_10.csv         27102017 02:14

^{Created on 2023-01-18 with reprex v2.0.2}

Using the code above reads the date column in as a numeric value i.e. 27,102,017 (~27 million). To properly format your "date" column as a date (ddmmyyyy) for plotting/stats/etc, here is one option using dmy() from the lubridate package:

library(tidyverse)
library(lubridate)
#> Loading required package: timechange
#> 
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union

df <- read_csv(file = "~/Desktop/BC_2017_10.csv", skip = 5, col_names = TRUE, id = "File")
#> Rows: 4 Columns: 3
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> dbl  (1): Date : ddmmyyy
#> time (1): Time :hh:mm:ss
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

df_formatted <- df %>%
  mutate(`Date : ddmmyyy` = dmy(`Date : ddmmyyy`))

df_formatted
#> # A tibble: 4 × 3
#>   File           `Date : ddmmyyy` `Time :hh:mm:ss`
#>   <chr>          <date>           <time>          
#> 1 BC_2017_10.csv 2017-10-27       01:44           
#> 2 BC_2017_10.csv 2017-10-27       01:54           
#> 3 BC_2017_10.csv 2017-10-27       02:04           
#> 4 BC_2017_10.csv 2017-10-27       02:14

^{Created on 2023-01-18 with reprex v2.0.2}

To read in multiple files with the format "BC-something-.csv" you could use:

library(tidyverse)
library(lubridate)
library(fs)
library(data.table)

files <- dir_ls(path = "~/Desktop", regexp = "BC.*\\.csv")
data <- map(files, ~read_csv(file = .x, skip = 5, col_names = TRUE, id = "File"))

df <- rbindlist(data, fill = TRUE)


df_formatted <- df %>%
  mutate(`Date : ddmmyyy` = dmy(`Date : ddmmyyy`))

df_formatted
#>              File Date : ddmmyyy Time :hh:mm:ss
#> 1: BC_2017_10.csv     2017-10-27       01:44:00
#> 2: BC_2017_10.csv     2017-10-27       01:54:00
#> 3: BC_2017_10.csv     2017-10-27       02:04:00
#> 4: BC_2017_10.csv     2017-10-27       02:14:00
#> 5: BC_2017_11.csv     2018-10-27       01:44:00
#> 6: BC_2017_11.csv     2018-10-27       01:54:00
#> 7: BC_2017_11.csv     2018-10-27       02:04:00
#> 8: BC_2017_11.csv     2018-10-27       02:14:00

^{Created on 2023-01-18 with reprex v2.0.2}