Home > database >  Adding part of filename into column of dataframe - with multiple input-files
Adding part of filename into column of dataframe - with multiple input-files

Time:11-16

Iam trying to extract some data from given ".dta"-files and add a new column to each created dataframe, containing a part of the filename (a year). This has to be done, because I want to do some cohort analysis and therefore track the origin of each observation inside the year column and later on manipulate it to see the cohort for each individual. For ".csv" files I have a somewhat working code, but there are some issues I guess I will face when dealing with the actual data. My code is as follows:

library(data.table) #-> for fread
library(readr) #-> for map_df
library(dplyr) #-> for pipe_operator
library(stringr) #-> for str_sub

### Get desired filenames structure (e.g. 4 digits for year)
filenames <- list.files("~/R/Data", full.names = TRUE, pattern = "*.csv")
sites <- str_sub(filenames, start = -5, end = -5) #just for experimental purpose -5 to -5
### Get length of each file
file_lengths <- unlist(lapply(lapply(filenames, read_csv2), nrow))
### Repeat sites using lengths
file_names <- rep(sites,file_lengths)
###actual file-reading
map_df_fread <- function(path,
                         pattern = "*.csv",
                         sep = ";",
                         dec = ",",
                         colClasses = NULL,
                         select = NULL) {
    list.files(path, pattern, full.names = TRUE) %>% 
    map_dfr(~fread(., sep = sep, dec = dec, stringsAsFactors = F,
                   header = T, colClasses = colClasses,
                   select = select)) %>%
    tibble() %>% mutate(year = file_names)
}

It does what it is supposed to do, at least on small datasets. The actual data has more than 10m observations per variable. Since I want to handle "*.dta" files, I guess I could substitute read_csv2() with read_dta but my concern is that in this step R would read the complete data, which I guess will take an extraordinarily amount of time. Is there anyway to include the first step into my file reading function (which will have to do this 20 times)? I would really like to limit the amount of memory needed for all those computations.

Any help would be appreciated!

Thanks in advance

CodePudding user response:

df <- data.frame(a = 1:5, b = c(T,F,T,F, F)), df$nms <- "filename" works because it gets recycled. Combined with the imap function that works on an iterator, we can make a column of (manipulated) file names directly instead of reading the files twice. To make this viable for .dta files, simply substitute the relevant I/O functions and the pattern in list.files or function arguments.

# NOTRUN
write.csv(data.frame(a = 1:5, b = c(T, F, T, F, F)), file = "t1.csv")
write.csv(data.frame(a = 1:3, b = c(F, F, F)), file = "t2.csv")

f <- function(path = NULL){
  if(is.null(path)) path <- getwd()
  fls <- as.list(list.files(path = path, pattern = ".csv"))
  string <- lapply(fls, substring, 2, 2)
  lapply(fls, data.table::fread) |>
    purrr::imap(~.x |> transform(year = string[.y]))
}

f()

[[1]]
   V1 a     b year
1:  1 1  TRUE    1
2:  2 2 FALSE    1
3:  3 3  TRUE    1
4:  4 4 FALSE    1
5:  5 5 FALSE    1

[[2]]
   V1 a     b year
1:  1 1 FALSE    2
2:  2 2 FALSE    2
3:  3 3 FALSE    2

Or if you want a single data.frame

do.call(rbind, f())
  •  Tags:  
  • r
  • Related