EDIT: To make things as reproducible as possible, I made a dummy bucket on AWS that I will give the credentials for.
What I'm trying to accomplish: Create a new column in each dataframe that comes from a CSV file in the S3 bucket called state_name
that contains the name of the state in the .csv
file name.
Here is the access:
Sys.setenv(
"AWS_ACCESS_KEY_ID" = "AKIAX2TY5WNMXPACEXVV",
"AWS_SECRET_ACCESS_KEY" = "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",
"AWS_DEFAULT_REGION" = "us-east-1"
)
The first part of the script creates a dataframe based on the file information coming from the S3 bucket. There's also some cleaning that isn't particularly relevant to this question, but I kept in to maintain script integrity.
#Retrieves all the items in an S3 bucket.
all_buckets <-aws.s3::get_bucket_df(bucket = "so-bucket-test-1222", max = Inf) %>% arrange(desc(LastModified))
#Filtering out some unnecessary files in the bucket (not super relevant to this question, but keeping it in to maintain script integrity)
buckets_format <- all_buckets %>% filter(str_detect(Key, '-Algorithm='))
buckets_script_results <- all_buckets %>% filter(str_detect(Key, '-script_results-')) %>% filter(!str_detect(Key, '-Algorithm=')) %>% filter(!str_detect(Key, '2020'))
The next part of the script is creating a character vector with all of the names of the CSV files that are going to be extracted from the S3 bucket, opened, and put it into a dataframe.
#Extracts the key needed to retrieve and open the CSV files in the bucket
script_results <- buckets_script_results$Key
Here is the part where I actually extract the data from the CSVs living in the S3 bucket, and put it all into one dataframe:
#This is the actual extraction process where the files are downloaded and put into a dataframe.
build_list <- lapply(script_results, function(fn) aws.s3::s3read_using(read.csv, bucket="so-bucket-test-1222", object=fn))
build_df <- do.call(rbind.fill, archives)
So what I am trying to do is build a process into this function that takes the extracted state name from the file name, and then using that to create a new column in each downloaded file that contains the state name.
For example: The pennsylvania
file will have a new column called state_name
where all four rows say pennsylvania
. Same thing for all 939 rows in the wisconsin
file, and so on...
So the final dataframe should look something like this:
final_df <- read.csv('https://raw.githubusercontent.com/datacfb123/testdata/main/final_df.csv')
EDIT: From Cazman's answer
Sys.setenv(
"AWS_ACCESS_KEY_ID" = "AKIAX2TY5WNMXPACEXVV",
"AWS_SECRET_ACCESS_KEY" = "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",
"AWS_DEFAULT_REGION" = "us-east-1"
)
all_buckets <-aws.s3::get_bucket_df(bucket = "so-bucket-test-1222", max = Inf) %>% arrange(desc(LastModified))
buckets_script_results <- all_buckets %>% filter(str_detect(Key, '-script_results-')) %>% filter(!str_detect(Key, '-Algorithm=')) %>% filter(!str_detect(Key, '2020'))
script_results <- buckets_script_results$Key
build_func <- function(path) {
state = str_extract(path, paste(tolower(state_name), collapse = "|"))
temp = aws.s3::s3read_using(read.csv, bucket="so-bucket-test-1222", object=path)
temp['state'] = state
return(temp)
}
build_list <- lapply(script_results, build_func)
CodePudding user response:
If I understand correctly, you want to do what @GregorThomas said in regards to finding the state name in the path, but to do it in the context of the function inside the build_list
apply()
. I couldn't connect to the AWS database, so I modified everything to read from the GitHub repo, but if I understand the s3read_using()
function correctly, it will return a data frame from read.csv()
in this case.
s = lapply(script_results, function(x) paste('https://raw.githubusercontent.com/datacfb123/testdata/main/', x, sep = ""))
s = unlist(s)
state_name = str_replace(state.name, "\\s", "-")
build_func <- function(path) {
state = str_extract(path, paste(tolower(state_name), collapse = "|"))
temp = read.csv(path) %>%
select(c(1: 7))
temp['state'] = state
return(temp)
}
build_list <- lapply(s, build_func)
build_df <- bind_rows(build_list, .id = 'column_label')
The head of build_df
:
column_label ID ID.Type Voter.First.Name Voter.Last.Name Voter.Phone
1 1 9321329 B firstname1 lastname1 1234567890
2 1 8958993 B firstname2 lastname2 1234567891
3 1 119964 B firstname3 lastname3 1234567892
4 1 12713977 B firstname4 lastname4 1234567893
5 1 13997387 B firstname5 lastname5 1234567894
6 1 9813986 B firstname6 lastname6 1234567895
Date.Called Time.Called..EST. state Voter.ID
1 10/16/20 06:30 PM EST wisconsin NA
2 10/16/20 06:31 PM EST wisconsin NA
3 10/16/20 06:31 PM EST wisconsin NA
4 10/16/20 06:31 PM EST wisconsin NA
5 10/16/20 06:31 PM EST wisconsin NA
6 10/16/20 06:31 PM EST wisconsin NA
and the tail:
column_label ID ID.Type Voter.First.Name Voter.Last.Name Voter.Phone
1469 4 NA A firstname422 lastname422 1234568311
1470 4 NA A firstname423 lastname423 1234568312
1471 4 NA A firstname424 lastname424 1234568313
1472 4 NA A firstname425 lastname425 1234568314
1473 4 NA A firstname426 lastname426 1234568315
1474 4 NA A firstname427 lastname427 1234568316
Date.Called Time.Called..EST. state Voter.ID
1469 9/28/21 07:57 PM EST north-carolina 10248248
1470 9/28/21 07:56 PM EST north-carolina 6290407
1471 9/28/21 07:58 PM EST north-carolina 18397149
1472 9/28/21 07:58 PM EST north-carolina 19520720
1473 9/28/21 07:59 PM EST north-carolina 17216409
1474 9/28/21 08:00 PM EST north-carolina 19273804
The only thing I really did was modify the state.names
to replace spaces with -
since your paths looked to be formatted that way, and put what @GregorThomas wrote into a function.
With the credentials, build_func()
turns into:
build_func <- function(path) {
state = str_extract(path, paste(tolower(state_name), collapse = "|"))
temp = aws.s3::s3read_using(read.csv, bucket="so-bucket-test-1222", object=path)
temp['state'] = state
return(temp)
}
and the call is:
build_list <- lapply(script_results, build_func)
and head(build_df[, 1:11])
:
column_label ID ID.Type Voter.First.Name Voter.Last.Name Voter.Phone
1 1 7591131 A firstname1 lastname1 1234567890
2 1 7930949 B firstname2 lastname2 1234567890
3 1 7371170 B firstname3 lastname3 1234567890
4 1 7725825 A firstname4 lastname4 1234567890
5 2 11808809 A firstname1 lastname1 1234567890
6 2 2236195 A firstname2 lastname2 1234567891
Date.Called Time.Called..EST. starting_question state call_ask
1 10/19/21 04:35 PM EST Wrong Number pennsylvania <NA>
2 10/19/21 04:28 PM EST Not Home (1) pennsylvania <NA>
3 10/19/21 04:33 PM EST Refused (2) pennsylvania <NA>
4 10/19/21 04:34 PM EST Hangup pennsylvania <NA>
5 11/29/21 03:16 PM EST Talking to Correct Person arizona Yes
6 11/29/21 03:14 PM EST Wrong Number (2) arizona
There are quite a few columns that seem to have a lot of NA
, but I assume that is the data? Regardless unique(build_df$state)
returns:
[1] "pennsylvania" "arizona" "north-carolina"
with 535 total rows.
CodePudding user response:
We can use the built-in vector state.name
to extract the state names using regex.
x = c("10162021-test-wisconsin-script_results-10-16-2020-568019545.csv",
"10192021-test-pennsylvania-script_results-10-19-2021-935649800.csv",
"11292021-test-arizona-script_results-11-29-2021-716199808.csv")
library(stringr)
(states = str_extract(x, paste(tolower(state.name), collapse = "|")))
# [1] "wisconsin" "pennsylvania" "arizona"
Then, if you use set the list names to be the extracted state names, dplyr::bind_rows
can add an ID column:
names(build_list) = states
build_df = dplyr::bind_rows(build_list, .id = "state")
In the context of your script:
all_buckets <-aws.s3::get_bucket_df(bucket = "test-bucket", max = Inf) %>% arrange(desc(LastModified))
#Filtering out some unnecessary files in the bucket (not super relevant to this question, but keeping it in to maintain script integrity)
buckets_format <- all_buckets %>% filter(str_detect(Key, '-Algorithm='))
buckets_script_results <- all_buckets %>% filter(str_detect(Key, '-script_results-')) %>% filter(!str_detect(Key, '-Algorithm=')) %>% filter(!str_detect(Key, '2020'))
#Extracts the key needed to retrieve and open the CSV files in the bucket
script_results <- buckets_script_results$Key
#This is the actual extraction process where the files are downloaded and put into a dataframe.
build_list <- lapply(script_results, function(fn) aws.s3::s3read_using(read.csv, bucket="test-bucket", object=fn))
### NEW LINES
states = stringr::str_extract(x, paste(tolower(state.name), collapse = "|"))
names(build_list) = states
build_df <- dplyr::bind_rows(build_list, .id = "state")
## (assuming `archives` was a typo for `build_list`
## since `build_list` doesn't come up again
## and `archives` came out of nowhere)