Home > OS >  Extract elements from list of httr headers
Extract elements from list of httr headers

Time:08-10

I have a simple question that has nonetheless stumped me.

I am trying to extract specific elements from a list of website headers for a set of given URLs. I have obtained the website headers using the httr package. Using magrittr::extract, I am able to successfully extract one element from the header for each URL and include this element in a tibble. However, I am having difficulty figuring out how to extract more than one element from the header for each URL.

For example, the below code helps me successfully extract the "status_code" for each URL and include it within a tibble. There is only one "status_code" for each URL.

pacman::p_load(httr, rvest, dplyr, purrr, tidyr)

some_urls <- c("https://www.psychologytoday.com/us/therapists/new-york/a?page=10",
               "https://www.psychologytoday.com/us/therapists/new-york/a?page=4",
               "https://www.psychologytoday.com/us/therapists/new-york/a?page=140",
               "https://www.psychologytoday.com/us/therapists/new-york/a?page=3"
)

df <- map_dfr(some_urls, ~{
    httr::GET(.x) %>% 
    magrittr::extract(c("url", "status_code"))
})

However, I am not interested in "status_code," but in "status." There may be MORE than one "status" for each URL. I am interested in extracting EVERY "status" for each URL and adding it to a tibble.

The below code does not work, because there is more than one "status" for each URL.

df <- map_dfr(some_urls, ~{
  httr::GET(.x) %>% 
  magrittr::extract(c("url", "status"))
})

This code gives me the following result:

Error:
! Column names `url`, `url`, and `url` must not be duplicated.
Use .name_repair to specify repair.
Caused by error in `repaired_names()`:
! Names must be unique.
✖ These names are duplicated:
  * "url" at locations 1, 2, 3, and 4.
Backtrace:
 1. purrr::map_dfr(...)
 2. dplyr::bind_rows(res, .id = .id)
 4. tibble:::as_tibble.list(dots)
 5. tibble:::lst_to_tibble(x, .rows, .name_repair, col_lengths(x))
 6. tibble:::set_repaired_names(x, repair_hint = TRUE, .name_repair)
 8. tibble:::repaired_names(= NULL)
 Error: 
Caused by error in `repaired_names()`:
! Names must be unique.
✖ These names are duplicated:
* "url" at locations 1, 2, 3, and 4.

I greatly appreciate any advice you may have! If inserting "name_repair" somewhere into my code is the answer, I have been unable to figure out how to successfully use this in my code. I have also tried setting column names in advance but seem to be unsuccessfully able to do this too. Please let me if you have any advice regarding how I can successfully extract this information!

CodePudding user response:

We may paste the multiple code into a single string - all_headers is a list which can vary in length from 1 to n. If there are more elements, loop over the all_headers with map, pluck the 'status' from each of those elements and either paste (toString)

library(purrr)
library(dplyr)
map_dfr(some_urls, ~{
    httr::GET(.x, user_agent) %>%
       {tibble(url = .$url,
               status = toString(unlist(map(.$all_headers, pluck, "status"))))}
    })

-output

# A tibble: 4 × 2
  url                                                              status  
  <chr>                                                            <chr>   
1 https://www.psychologytoday.com/us/therapists/new-york/a?page=10 200     
2 https://www.psychologytoday.com/us/therapists/new-york/a?page=4  200     
3 https://www.psychologytoday.com/us/therapists/new-york           302, 200
4 https://www.psychologytoday.com/us/therapists/new-york/a?page=3  200     

or return a list and then unnest the list column later

library(tidyr)
map_dfr(some_urls, ~{
    httr::GET(.x, user_agent) %>%
       {tibble(url = .$url,
    status = map(.$all_headers, pluck, "status"))}
    }) %>% 
   unnest(status)

-output

# A tibble: 5 × 2
  url                                                              status
  <chr>                                                             <int>
1 https://www.psychologytoday.com/us/therapists/new-york/a?page=10    200
2 https://www.psychologytoday.com/us/therapists/new-york/a?page=4     200
3 https://www.psychologytoday.com/us/therapists/new-york              302
4 https://www.psychologytoday.com/us/therapists/new-york              200
5 https://www.psychologytoday.com/us/therapists/new-york/a?page=3     200
  • Related