Checking if URLs "exist" in R-CodePudding

I am trying to check if a large list of URLs "exist" in R. Let me know if you can help!

My objective: I am trying to check whether URLs from the Psychology Today online therapist directory exist. I have a data frame of many possible URLs from this directory. Some of them do exist, but some of them do not exist. When URLs do not exist, they return to a generic Psychology Today online website.

For example, this URL exists: "https://www.psychologytoday.com/us/therapists/new-york/a?page=10". This is the tenth page of New York therapists whose last names start with "A." There are at least 10 pages of New York therapists whose names start with "A," so the page exists.

However, this URL does not exist: "https://www.psychologytoday.com/us/therapists/new-york/a?page=119". There are not 119 pages of therapists in New York whose last name starts with "A". Accordingly, the Psychology Today website redirects you to a generic site: "https://www.psychologytoday.com/us/therapists/new-york/a".

My ultimate goal is to get a complete listing of all pages that do exist for New York therapists whose last names start with "A" (and then I will repeat this for other letters, etc.).

Previous post on this topic: There is a prior StackOverflow post on this topic (Check if URL exists in R), and I have implemented the solutions from this post. However, each of the solutions from this previous post falsely reports that my specific URLs of interest do not exist, even if they do exist!

My code: I have tried the below code to check if these URLs exist. Both code solutions are drawn from the prior post on this topic (linked above). However, both code solutions tell me that URLs that do exist on Psychology Today do not exist. I am not sure why this is!

Loading packages:

### Load packages and set user agent
pacman::p_load(dplyr, tidyr, stringr, tidyverse, RCurl, pingr)

# Set alternative user agent globally for whole session
options(HTTPUserAgent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.84 Safari/537.36")

# Check user agent string again
options("HTTPUserAgent")

Keep only URLs that are "real": RCurl solution

url.exists("https://www.psychologytoday.com/us/therapists/new-york/a?page=3")

Result: This solution returns "FALSE", even though this page does exist!

Keep only directory page URLs that are "real": StackExchange post comment #1 solution

### Function for checking if URLs are "real"
  # From StackOverflow: https://stackoverflow.com/questions/52911812/check-if-url-exists-in-r
#' @param x a single URL
#' @param non_2xx_return_value what to do if the site exists but the
#'        HTTP status code is not in the `2xx` range. Default is to return `FALSE`.
#' @param quiet if not `FALSE`, then every time the `non_2xx_return_value` condition
#'        arises a warning message will be displayed. Default is `FALSE`.
#' @param ... other params (`timeout()` would be a good one) passed directly
#'        to `httr::HEAD()` and/or `httr::GET()`
url_exists <- function(x, non_2xx_return_value = FALSE, quiet = FALSE,...) {

  suppressPackageStartupMessages({
    require("httr", quietly = FALSE, warn.conflicts = FALSE)
  })

  # you don't need thse two functions if you're alread using `purrr`
  # but `purrr` is a heavyweight compiled pacakge that introduces
  # many other "tidyverse" dependencies and this doesnt.

  capture_error <- function(code, otherwise = NULL, quiet = TRUE) {
    tryCatch(
      list(result = code, error = NULL),
      error = function(e) {
        if (!quiet)
          message("Error: ", e$message)

        list(result = otherwise, error = e)
      },
      interrupt = function(e) {
        stop("Terminated by user", call. = FALSE)
      }
    )
  }

  safely <- function(.f, otherwise = NULL, quiet = TRUE) {
    function(...) capture_error(.f(...), otherwise, quiet)
  }

  sHEAD <- safely(httr::HEAD)
  sGET <- safely(httr::GET)

  # Try HEAD first since it's lightweight
  res <- sHEAD(x, ...)

  if (is.null(res$result) || 
      ((httr::status_code(res$result) %/% 200) != 1)) {

    res <- sGET(x, ...)

    if (is.null(res$result)) return(NA) # or whatever you want to return on "hard" errors

    if (((httr::status_code(res$result) %/% 200) != 1)) {
      if (!quiet) warning(sprintf("Requests for [%s] responded but without an HTTP status code in the 200-299 range", x))
      return(non_2xx_return_value)
    }

    return(TRUE)

  } else {
    return(TRUE)
  }

}

### Create URL list
some_urls <- c("https://www.psychologytoday.com/us/therapists/new-york/a?page=10", # Exists
               "https://www.psychologytoday.com/us/therapists/new-york/a?page=4", # Exists
               "https://www.psychologytoday.com/us/therapists/new-york/a?page=140", # Does not exist
               "https://www.psychologytoday.com/us/therapists/new-york/a?page=3" # Exists
)

### Check if URLs exist
data.frame(
  exists = sapply(some_urls, url_exists, USE.NAMES = FALSE),
  some_urls,
  stringsAsFactors = FALSE
) %>% dplyr::tbl_df() %>% print()

Result: This solution returns "FALSE" for every URL, even though 3 out of 4 of them do exist!

Please let me know if you have any advice! I greatly appreciate any advice or suggestions you may have. Thank you!

CodePudding user response：

Both solutions are based on libcurl. Default user agent of httr includes versions of Curl, RCurl and httr. You can check it with verbose mode:

> httr::HEAD(some_urls[1], httr::verbose())
-> HEAD /us/therapists/new-york/a?page=10 HTTP/2
-> Host: www.psychologytoday.com
-> user-agent: libcurl/7.68.0 r-curl/4.3.2 httr/1.4.3    <<<<<<<<< Here is the problem. I think the site disallows webscraping. You need to check the related robots.txt file(s).
-> accept-encoding: deflate, gzip, br
-> cookie: summary_id=62e1a40279e4c
-> accept: application/json, text/xml, application/xml, */*
-> 
<- HTTP/2 403 
<- date: Wed, 27 Jul 2022 20:56:28 GMT
<- content-type: text/html; charset=iso-8859-1
<- server: Apache/2.4.53 (Amazon)
<- 
Response [https://www.psychologytoday.com/us/therapists/new-york/a?page=10]
  Date: 2022-07-27 20:56
  Status: 403
  Content-Type: text/html; charset=iso-8859-1
<EMPTY BODY>

You can set user-agent header per function calls. I do not know the global option way in this case:

> user_agent <- httr::user_agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.84 Safari/537.36")
> httr::HEAD(some_urls[1], user_agent, httr::verbose())

-> HEAD /us/therapists/new-york/a?page=10 HTTP/2
-> Host: www.psychologytoday.com
-> user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.84 Safari/537.36
-> accept-encoding: deflate, gzip, br
-> cookie: summary_id=62e1a40279e4c
-> accept: application/json, text/xml, application/xml, */*
-> 
<- HTTP/2 200 
<- date: Wed, 27 Jul 2022 21:01:07 GMT
<- content-type: text/html; charset=utf-8
<- server: Apache/2.4.54 (Amazon)
<- x-powered-by: PHP/7.0.33
<- content-language: en-US
<- x-frame-options: SAMEORIGIN
<- expires: Wed, 27 Jul 2022 22:01:07 GMT
<- cache-control: private, max-age=3600
<- last-modified: Wed, 27 Jul 2022 21:01:07 GMT
<- set-cookie: search-language=deleted; expires=Thu, 01-Jan-1970 00:00:01 GMT; Max-Age=0; path=/; secure; HttpOnly

NOTE: bunch of set-cookie deleted here

<- set-cookie: search-language=deleted; expires=Thu, 01-Jan-1970 00:00:01 GMT; Max-Age=0; path=/; secure; HttpOnly
<- via: 1.1 ZZ
<- 
Response [https://www.psychologytoday.com/us/therapists/new-york/a?page=10]
  Date: 2022-07-27 21:01
  Status: 200
  Content-Type: text/html; charset=utf-8
<EMPTY BODY>

NOTE: I did not investigate the url.exists of RCurl. You need to ensure somehow it uses the right user-agent string.

In a nutshell with no verbose:

> user_agent <- httr::user_agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.84 Safari/537.36")
> (httr::status_code(httr::HEAD(some_urls[1], user_agent)) %/% 200) == 1
[1] TRUE
>

I think you can write your own solution from here.

CodePudding user response：

Here is a way to read the persons name and job/affiliation and their statements.

First the therapists initial page is read, then a list of US states is extracted from there, then each state link is read, which may take a while. Finally, a function to extract the information above is applied to each state page.

library(httr)
library(rvest)
library(dplyr)
library(purrr)

state_info <- function(state) {
  person <- state %>%
    html_elements("a.profile-title") %>%
    html_attr("title") %>%
    sub("See full profile of ", "", .)

  statement <- state %>%
    html_elements("div.statements") %>%
    stringr::str_extract(">.*<") %>%
    gsub("<|>", "", .)

  data.frame(person, statement)
}

link <- "https://www.psychologytoday.com/us/therapists"
page <- link %>%
  read_html()

page %>%
  html_elements("div.us_region_list") %>%
  html_elements("a") %>%
  html_attr("href") %>%
  map(read_html) -> states

# Do this for only 3 states,
# remove the `[1:3]` part to get all states info
states[1:3] %>%
  map(state_info) -> therapists_data_list

therapists_data_list[[1]]