I have a dataframe that contains the names of a bunch of .CSV files. It looks how it does in the snippet below:
What I'm trying to do is convert each of these .CSVs into a dataframe that appends the results of each. What I'm trying to do is create three different dataframes based on what's in the file names:
- Create a dataframe with all results from .CSV files with
-callers-
in its file name - Create a dataframe with all results from .CSV files with
-results
in its filename - Create a dataframe with all results from .CSV files with
-script_results-
in its filename
The command to actually convert the .CSV file into a dataframe looks like this if I were using the first .CSV in the dataframe below:
data <- aws.s3::s3read_using(read.csv, object = "s3://abc-testtalk/08182020-testpilot-arizona-results-08-18-2020--08-18-2020-168701001.csv")
But what I'm trying to do is:
Iterate ALL the .csv files under
Key
using thes3read_using
functionPut them in three separate dataframes based on the file names as listed above
Key 08182020-testpilot-arizona-results-08-18-2020--08-18-2020-168701001.csv 08182020-testpilot-arizona-results-08-18-2020--08-18-2020-606698088.csv 08182020-testpilot-arizona-script_results-08-18-2020--08-18-2020-114004469.csv 08182020-testpilot-arizona-script_results-08-18-2020--08-18-2020-450823767.csv 08182020-testpilot-iowa-callers-08-18-2020-374839084.csv 08182020-testpilot-maine-callers-08-18-2020-396935866.csv 08182020-testpilot-maine-results-08-18-2020--08-18-2020-990912614.csv 08182020-testpilot-maine-script_results-08-18-2020--08-18-2020-897037786.csv 08182020-testpilot-michigan-callers-08-18-2020-367670258.csv 08182020-testpilot-michigan-follow-ups-08-18-2020--08-18-2020-049435266.csv 08182020-testpilot-michigan-results-08-18-2020--08-18-2020-544974900.csv 08182020-testpilot-michigan-script_results-08-18-2020--08-18-2020-239089219.csv 08182020-testpilot-nevada-callers-08-18-2020-782329503.csv 08182020-testpilot-nevada-results-08-18-2020--08-18-2020-348644934.csv 08182020-testpilot-nevada-script_results-08-18-2020--08-18-2020-517037762.csv 08182020-testpilot-new-hampshire-callers-08-18-2020-134150800.csv 08182020-testpilot-north-carolina-callers-08-18-2020-739838755.csv 08182020-testpilot-pennsylvania-callers-08-18-2020-223839956.csv 08182020-testpilot-pennsylvania-results-08-18-2020--08-18-2020-747438886.csv 08182020-testpilot-pennsylvania-script_results-08-18-2020--08-18-2020-546894204.csv 08182020-testpilot-virginia-callers-08-18-2020-027531377.csv 08182020-testpilot-virginia-follow-ups-08-18-2020--08-18-2020-419338697.csv 08182020-testpilot-virginia-results-08-18-2020--08-18-2020-193170030.csv
CodePudding user response:
Create 3 empty dataframes. You will probably also need to indicate column names matching column names from each of the file you want to append:
results <- data.frame()
script_results <- data.frame()
callers <- data.frame()
Then iterate over file_name
and read it into data
object. Conditionally on what pattern ("-results-", "-script_results-" or "-caller-" is contanied in the name of each file, it will be appended to the correct dataframe:
for (file in file_name) {
data <- aws.s3::s3read_using(read.csv, object = paste0("s3://abc-testtalk/", file))
if (grepl(file, "-results-")) { results <- rbind(results, data)}
if (grepl(file, "-script_results-")) { script_results <- rbind(script_results, data)}
if (grepl(file, "-callers-")) { callers <- rbind(callers, data)}
}
CodePudding user response:
As an alternative to @JohnFranchak's recommendation for map_dfr
(which likely works just fine), the method that I referenced in comments would look something like this:
alldat <- lapply(setNames(nm = dat$file_name),
function(obj) aws.s3::s3read_using(read.csv, object = obj))
callers <- do.call(rbind, alldat[grepl("-callers-", names(alldat))])
results <- do.call(rbind, alldat[grepl("-results-", names(alldat))])
script_results <- do.call(rbind, alldat[grepl("-script_results-", names(alldat))])
others <- do.call(rbind, alldat[!grepl("-(callers|results|script_results)-", names(alldat))])
The do.call(rbind, ...)
part is analogous to dplyr::bind_rows
and data.table::rbindlist
in that it accepts a list of frames, and the result is a single frame. Some differences:
do.call(rbind, ...)
really requires all columns to exist in all frames, in the same order. It's not hard to enforce this externally (e.g., adding missing columns, rearranging), but it's not automatic.data.table::rbindlist
will complain for the same conditions (missing columns or different order), but it hasfill=
anduse.names=
arguments that need to be setTRUE
.dplyr::bind_rows
will fill and row-bind by-name by default, without message or warning. (I don't agree that a default of silence is good all of the time, but it is the simplest.)
Lastly, my use of setNames(nm=..)
is merely to assign the filename to each object. This is not strictly necessary since we still have dat$file_name
, but I've found that with two separate objects, it is feasible to accidentally change (delete, append, or reorder) one of them and not the other, so I prefer to keep the names and the objects (frames) perfectly tied together. These two calls are relatively the same in the resulting named-list:
lapply(setNames(nm = dat$file_name), ...)
sapply(dat$file_name, ..., simplify = FALSE)