I have a simple question that has nonetheless stumped me.
I am trying to extract specific elements from a list of website headers for a set of given URLs. I have obtained the website headers using the httr
package. Using magrittr::extract
, I am able to successfully extract one element from the header for each URL and include this element in a tibble. However, I am having difficulty figuring out how to extract more than one element from the header for each URL.
For example, the below code helps me successfully extract the "status_code" for each URL and include it within a tibble. There is only one "status_code" for each URL.
pacman::p_load(httr, rvest, dplyr, purrr, tidyr)
some_urls <- c("https://www.psychologytoday.com/us/therapists/new-york/a?page=10",
"https://www.psychologytoday.com/us/therapists/new-york/a?page=4",
"https://www.psychologytoday.com/us/therapists/new-york/a?page=140",
"https://www.psychologytoday.com/us/therapists/new-york/a?page=3"
)
df <- map_dfr(some_urls, ~{
httr::GET(.x) %>%
magrittr::extract(c("url", "status_code"))
})
However, I am not interested in "status_code," but in "status." There may be MORE than one "status" for each URL. I am interested in extracting EVERY "status" for each URL and adding it to a tibble.
The below code does not work, because there is more than one "status" for each URL.
df <- map_dfr(some_urls, ~{
httr::GET(.x) %>%
magrittr::extract(c("url", "status"))
})
This code gives me the following result:
Error:
! Column names `url`, `url`, and `url` must not be duplicated.
Use .name_repair to specify repair.
Caused by error in `repaired_names()`:
! Names must be unique.
✖ These names are duplicated:
* "url" at locations 1, 2, 3, and 4.
Backtrace:
1. purrr::map_dfr(...)
2. dplyr::bind_rows(res, .id = .id)
4. tibble:::as_tibble.list(dots)
5. tibble:::lst_to_tibble(x, .rows, .name_repair, col_lengths(x))
6. tibble:::set_repaired_names(x, repair_hint = TRUE, .name_repair)
8. tibble:::repaired_names(= NULL)
Error:
Caused by error in `repaired_names()`:
! Names must be unique.
✖ These names are duplicated:
* "url" at locations 1, 2, 3, and 4.
I greatly appreciate any advice you may have! If inserting "name_repair" somewhere into my code is the answer, I have been unable to figure out how to successfully use this in my code. I have also tried setting column names in advance but seem to be unsuccessfully able to do this too. Please let me if you have any advice regarding how I can successfully extract this information!
CodePudding user response:
We may paste the multiple code into a single string - all_headers
is a list
which can vary in length from 1 to n. If there are more elements, loop over the all_headers
with map
, pluck
the 'status' from each of those elements and either paste
(toString
)
library(purrr)
library(dplyr)
map_dfr(some_urls, ~{
httr::GET(.x, user_agent) %>%
{tibble(url = .$url,
status = toString(unlist(map(.$all_headers, pluck, "status"))))}
})
-output
# A tibble: 4 × 2
url status
<chr> <chr>
1 https://www.psychologytoday.com/us/therapists/new-york/a?page=10 200
2 https://www.psychologytoday.com/us/therapists/new-york/a?page=4 200
3 https://www.psychologytoday.com/us/therapists/new-york 302, 200
4 https://www.psychologytoday.com/us/therapists/new-york/a?page=3 200
or return a list
and then unnest
the list column later
library(tidyr)
map_dfr(some_urls, ~{
httr::GET(.x, user_agent) %>%
{tibble(url = .$url,
status = map(.$all_headers, pluck, "status"))}
}) %>%
unnest(status)
-output
# A tibble: 5 × 2
url status
<chr> <int>
1 https://www.psychologytoday.com/us/therapists/new-york/a?page=10 200
2 https://www.psychologytoday.com/us/therapists/new-york/a?page=4 200
3 https://www.psychologytoday.com/us/therapists/new-york 302
4 https://www.psychologytoday.com/us/therapists/new-york 200
5 https://www.psychologytoday.com/us/therapists/new-york/a?page=3 200