I have multiple files (~100) whose only identifier is their filename. I need to incorporate this into my data frame while also excluding the first 5 rows of useless data. For example, I have:
BC_2017_10.csv
SN : 0123456 | |
---|---|
Site : A | |
Eo25: -4.117101e-01 | |
QF follow QARTOD | |
Data interpolated | |
Date : ddmmyyy | Time :hh:mm:ss |
27102017 | 01:44:00 |
27102017 | 01:54:00 |
27102017 | 02:04:00 |
27102017 | 02:14:00 |
I would like:
File | Date: ddmmyyy | Time: hh:mm:ss |
---|---|---|
BC_2017_10 | 27102017 | 01:44:00 |
BC_2017_10 | 27102017 | 01:54:00 |
BC_2017_10 | 27102017 | 02:04:00 |
BC_2017_10 | 27102017 | 02:14:00 |
I know the answer must lie with list.files, skip, and probably rbindlist but I can't figure it out.
CodePudding user response:
You can use read_csv()
from the readr package and set the required options, e.g.
library(readr)
df <- read_csv(file = "BC_2017_10.csv", skip = 5, col_names = TRUE, id = "File")
#> Rows: 4 Columns: 3
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> dbl (1): Date : ddmmyyy
#> time (1): Time :hh:mm:ss
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
df
#> # A tibble: 4 × 3
#> File `Date : ddmmyyy` `Time :hh:mm:ss`
#> <chr> <dbl> <time>
#> 1 BC_2017_10.csv 27102017 01:44
#> 2 BC_2017_10.csv 27102017 01:54
#> 3 BC_2017_10.csv 27102017 02:04
#> 4 BC_2017_10.csv 27102017 02:14
Created on 2023-01-18 with reprex v2.0.2
Using the code above reads the date column in as a numeric value i.e. 27,102,017 (~27 million). To properly format your "date" column as a date (ddmmyyyy) for plotting/stats/etc, here is one option using dmy()
from the lubridate package:
library(tidyverse)
library(lubridate)
#> Loading required package: timechange
#>
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
df <- read_csv(file = "~/Desktop/BC_2017_10.csv", skip = 5, col_names = TRUE, id = "File")
#> Rows: 4 Columns: 3
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> dbl (1): Date : ddmmyyy
#> time (1): Time :hh:mm:ss
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
df_formatted <- df %>%
mutate(`Date : ddmmyyy` = dmy(`Date : ddmmyyy`))
df_formatted
#> # A tibble: 4 × 3
#> File `Date : ddmmyyy` `Time :hh:mm:ss`
#> <chr> <date> <time>
#> 1 BC_2017_10.csv 2017-10-27 01:44
#> 2 BC_2017_10.csv 2017-10-27 01:54
#> 3 BC_2017_10.csv 2017-10-27 02:04
#> 4 BC_2017_10.csv 2017-10-27 02:14
Created on 2023-01-18 with reprex v2.0.2
To read in multiple files with the format "BC-something-.csv" you could use:
library(tidyverse)
library(lubridate)
library(fs)
library(data.table)
files <- dir_ls(path = "~/Desktop", regexp = "BC.*\\.csv")
data <- map(files, ~read_csv(file = .x, skip = 5, col_names = TRUE, id = "File"))
df <- rbindlist(data, fill = TRUE)
df_formatted <- df %>%
mutate(`Date : ddmmyyy` = dmy(`Date : ddmmyyy`))
df_formatted
#> File Date : ddmmyyy Time :hh:mm:ss
#> 1: BC_2017_10.csv 2017-10-27 01:44:00
#> 2: BC_2017_10.csv 2017-10-27 01:54:00
#> 3: BC_2017_10.csv 2017-10-27 02:04:00
#> 4: BC_2017_10.csv 2017-10-27 02:14:00
#> 5: BC_2017_11.csv 2018-10-27 01:44:00
#> 6: BC_2017_11.csv 2018-10-27 01:54:00
#> 7: BC_2017_11.csv 2018-10-27 02:04:00
#> 8: BC_2017_11.csv 2018-10-27 02:14:00
Created on 2023-01-18 with reprex v2.0.2