I am working with many csv files that are labelled with the month of year in brackets. For example:
files_names <- list.files("data/", recursive = TRUE, full.names = TRUE)
[1] "data/BOC_All_ATMImage_(Aug 2020).txt" "data/BOC_All_ATMImage_(Aug 2021).txt"
[3] "data/BOC_All_ATMImage_(Feb 2021).txt" "data/BOC_All_ATMImage_(Feb_2020).txt"
[5] "data/BOC_All_ATMImage_(May 2021).txt" "data/BOC_All_ATMImage_(Nov 2019).txt"
column_names <- files_names %>%
str_extract(., "(?<=\\().*?(?=\\))") %>%
str_to_lower() %>%
str_replace(., " ", "_")
"aug_2020" "aug_2021" "feb_2021" "feb_2020" "may_2021" "nov_2019"
I am using the map2
function in purrr
to process the csv files and setting a column name using files_names
and column_names
in a loop.
data <-
map2(files_names, column_names,
~ read_csv(.x, guess_max = 50000) %>%
mutate(
day = 01,
month_year = str_extract(.x, "(?<=\\().*?(?=\\))"),
date_dmy = paste0(day, "-", month_year),
date = dmy(date_dmy),
"{.y}" := 1
),
.id = "group"
)
I need to figure out how to arrange this list so each data set is in chronological order. One approach is to arrange the initial character vectors (files_names
and column_names
) before feeding them into to loop. Or perhaps it would be easier to simply arrange the data
list so the data frames are chronologically ordered? I have created a date
variable in each data frame so this could be another approach, but I'm not sure how to reorder the list by a date variable.
CodePudding user response:
We can use str_match
to search for months and years. After that, use some dplyr
to clean the data. To arrange the months I thought of using a factor.
library(tidyverse)
files_names <-
c(
"data/BOC_All_ATMImage_(Aug 2020).txt", "data/BOC_All_ATMImage_(Aug 2021).txt",
"data/BOC_All_ATMImage_(Feb 2021).txt", "data/BOC_All_ATMImage_(Feb_2020).txt",
"data/BOC_All_ATMImage_(May 2021).txt", "data/BOC_All_ATMImage_(Nov 2019).txt"
)
factor_w_month <- partial(factor, levels = )
months <- partial(factor, levels = (c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")))
files_names %>%
str_match(".*_\\((.*)[ _](\\d )\\)\\.txt$") %>%
as.data.frame() %>%
mutate(V2 = months(V2)) %>%
arrange(V3, V2) %>%
transmute(files_names = V1, column_names = str_to_lower(str_c(V2, '_', V3)))
#> files_names column_names
#> 1 data/BOC_All_ATMImage_(Nov 2019).txt nov_2019
#> 2 data/BOC_All_ATMImage_(Feb_2020).txt feb_2020
#> 3 data/BOC_All_ATMImage_(Aug 2020).txt aug_2020
#> 4 data/BOC_All_ATMImage_(Feb 2021).txt feb_2021
#> 5 data/BOC_All_ATMImage_(May 2021).txt may_2021
#> 6 data/BOC_All_ATMImage_(Aug 2021).txt aug_2021
Created on 2021-12-20 by the reprex package (v2.0.1)
CodePudding user response:
I think the following solution could also help you sort your dates before starting to read them into R:
library(dplyr)
library(stringr)
files_names %>%
enframe() %>%
mutate(date = str_extract(value, "(?<=\\().*(?=\\))"),
date = paste(str_extract(date, "\\d "), str_extract(date, "[[:alpha:]] "), "01",
sep = "-"),
date = as.Date(date, format = "%Y-%b-%d")) %>%
arrange(desc(date))
# A tibble: 6 x 3
name value date
<int> <chr> <date>
1 2 data/BOC_All_ATMImage_(Aug 2021).txt 2021-08-01
2 5 data/BOC_All_ATMImage_(May 2021).txt 2021-05-01
3 3 data/BOC_All_ATMImage_(Feb 2021).txt 2021-02-01
4 1 data/BOC_All_ATMImage_(Aug 2020).txt 2020-08-01
5 4 data/BOC_All_ATMImage_(Feb_2020).txt 2020-02-01
6 6 data/BOC_All_ATMImage_(Nov 2019).txt 2019-11-01
And some tiny hint about the regex you used, I think you don't need to make .*
part lazy.
CodePudding user response:
By parsing and ordering the date from column_names , you can arrange your files_names in chronological order and process your files from there
files_names <- list.files("data/", recursive = TRUE, full.names = TRUE)
column_names <- files_names %>%
str_extract(., "(?<=\\().*?(?=\\))") %>%
str_to_lower() %>%
str_replace(., " ", "_")
files_names <- files_names[
order(readr::parse_date(column_names,"%b_%Y"))]
files_names
[1] "data/BOC_All_ATMImage_(Nov 2019).txt"
[2] "data/BOC_All_ATMImage_(Feb_2020).txt"
[3] "data/BOC_All_ATMImage_(Aug 2020).txt"
[4] "data/BOC_All_ATMImage_(Feb 2021).txt"
[5] "data/BOC_All_ATMImage_(May 2021).txt"
[6] "data/BOC_All_ATMImage_(Aug 2021).txt"
CodePudding user response:
I can't really run your code without the csv files, but it looks like you've already got a list of tibbles, and you've already added a date column using the fragment from the file name. In this case, you just need
data %>% bind_rows() %>% arrange(date)
to get a single tibble, but with the rows ordered based on the date from the filename.