Home > Software engineering >  Iterating over CSVs to different dataframes based on file names
Iterating over CSVs to different dataframes based on file names

Time:10-06

I have a dataframe that contains the names of a bunch of .CSV files. It looks how it does in the snippet below:

What I'm trying to do is convert each of these .CSVs into a dataframe that appends the results of each. What I'm trying to do is create three different dataframes based on what's in the file names:

  1. Create a dataframe with all results from .CSV files with -callers- in its file name
  2. Create a dataframe with all results from .CSV files with -results in its filename
  3. Create a dataframe with all results from .CSV files with -script_results- in its filename

The command to actually convert the .CSV file into a dataframe looks like this if I were using the first .CSV in the dataframe below:

data <- aws.s3::s3read_using(read.csv, object = "s3://abc-testtalk/08182020-testpilot-arizona-results-08-18-2020--08-18-2020-168701001.csv")

But what I'm trying to do is:

  1. Iterate ALL the .csv files under Key using the s3read_using function

  2. Put them in three separate dataframes based on the file names as listed above

    Key
    08182020-testpilot-arizona-results-08-18-2020--08-18-2020-168701001.csv
    08182020-testpilot-arizona-results-08-18-2020--08-18-2020-606698088.csv
    08182020-testpilot-arizona-script_results-08-18-2020--08-18-2020-114004469.csv
    08182020-testpilot-arizona-script_results-08-18-2020--08-18-2020-450823767.csv
    08182020-testpilot-iowa-callers-08-18-2020-374839084.csv
    08182020-testpilot-maine-callers-08-18-2020-396935866.csv
    08182020-testpilot-maine-results-08-18-2020--08-18-2020-990912614.csv
    08182020-testpilot-maine-script_results-08-18-2020--08-18-2020-897037786.csv
    08182020-testpilot-michigan-callers-08-18-2020-367670258.csv
    08182020-testpilot-michigan-follow-ups-08-18-2020--08-18-2020-049435266.csv
    08182020-testpilot-michigan-results-08-18-2020--08-18-2020-544974900.csv
    08182020-testpilot-michigan-script_results-08-18-2020--08-18-2020-239089219.csv
    08182020-testpilot-nevada-callers-08-18-2020-782329503.csv
    08182020-testpilot-nevada-results-08-18-2020--08-18-2020-348644934.csv
    08182020-testpilot-nevada-script_results-08-18-2020--08-18-2020-517037762.csv
    08182020-testpilot-new-hampshire-callers-08-18-2020-134150800.csv
    08182020-testpilot-north-carolina-callers-08-18-2020-739838755.csv
    08182020-testpilot-pennsylvania-callers-08-18-2020-223839956.csv
    08182020-testpilot-pennsylvania-results-08-18-2020--08-18-2020-747438886.csv
    08182020-testpilot-pennsylvania-script_results-08-18-2020--08-18-2020-546894204.csv
    08182020-testpilot-virginia-callers-08-18-2020-027531377.csv
    08182020-testpilot-virginia-follow-ups-08-18-2020--08-18-2020-419338697.csv
    08182020-testpilot-virginia-results-08-18-2020--08-18-2020-193170030.csv
    

CodePudding user response:

Create 3 empty dataframes. You will probably also need to indicate column names matching column names from each of the file you want to append:

results <- data.frame()
script_results <- data.frame()
callers <- data.frame()

Then iterate over file_name and read it into data object. Conditionally on what pattern ("-results-", "-script_results-" or "-caller-" is contanied in the name of each file, it will be appended to the correct dataframe:

for (file in file_name) {
  data <- aws.s3::s3read_using(read.csv, object = paste0("s3://abc-testtalk/", file))
  
  if (grepl(file, "-results-")) { results <- rbind(results, data)}
  if (grepl(file, "-script_results-")) { script_results <- rbind(script_results, data)}
  if (grepl(file, "-callers-")) { callers <- rbind(callers, data)}
   
}

CodePudding user response:

As an alternative to @JohnFranchak's recommendation for map_dfr (which likely works just fine), the method that I referenced in comments would look something like this:

alldat <- lapply(setNames(nm = dat$file_name),
                 function(obj) aws.s3::s3read_using(read.csv, object = obj))

callers <- do.call(rbind, alldat[grepl("-callers-", names(alldat))])
results <- do.call(rbind, alldat[grepl("-results-", names(alldat))])
script_results <- do.call(rbind, alldat[grepl("-script_results-", names(alldat))])
others <- do.call(rbind, alldat[!grepl("-(callers|results|script_results)-", names(alldat))])

The do.call(rbind, ...) part is analogous to dplyr::bind_rows and data.table::rbindlist in that it accepts a list of frames, and the result is a single frame. Some differences:

  • do.call(rbind, ...) really requires all columns to exist in all frames, in the same order. It's not hard to enforce this externally (e.g., adding missing columns, rearranging), but it's not automatic.
  • data.table::rbindlist will complain for the same conditions (missing columns or different order), but it has fill= and use.names= arguments that need to be set TRUE.
  • dplyr::bind_rows will fill and row-bind by-name by default, without message or warning. (I don't agree that a default of silence is good all of the time, but it is the simplest.)

Lastly, my use of setNames(nm=..) is merely to assign the filename to each object. This is not strictly necessary since we still have dat$file_name, but I've found that with two separate objects, it is feasible to accidentally change (delete, append, or reorder) one of them and not the other, so I prefer to keep the names and the objects (frames) perfectly tied together. These two calls are relatively the same in the resulting named-list:

lapply(setNames(nm = dat$file_name), ...)
sapply(dat$file_name, ..., simplify = FALSE)
  • Related