No connection in R-CodePudding

I've been trying to learn webscraping from an online course, and they give the following as an example

url <- "https://www.canada.ca/en/employment-social-development/services/labour-relations/international/agreements.html" website<- read_html(url) treaties_links <- website %>% html_nodes("li") %>% html_nodes("a") %>% html_attr("href") treaties_links <-treaties_links[23:30] treaties_links_full <- lapply(treaties_links, function(x) (paste("https://www.canada.ca",x,sep=""))) treaties_links_full[8] <-treaties_links[8] treaty_texts <- lapply(treaties_links_full, function(x) (read_html(x)))

When I get to this last line it returns an error

Error in open.connection(x, "rb") : Could not resolve host: www.canada.cahttp

CodePudding user response：

Your error is in your lapply() code. If you print treaties_links, you will see that they are not all internal links, i.e. links starting with /, and some are links to other domains:

print(treaties_links)
[1] "/en/employment-social-development/services/labour-relations/international/agreements/chile.html"
[2] "/en/employment-social-development/services/labour-relations/international/agreements/costa-rica.html"
[3] "/en/employment-social-development/services/labour-relations/international/agreements/peru.html"
[4] "/en/employment-social-development/services/labour-relations/international/agreements/colombia.html"
[5] "/en/employment-social-development/services/labour-relations/international/agreements/jordan.html"
[6] "/en/employment-social-development/services/labour-relations/international/agreements/panama.html"
[7] "http://www.international.gc.ca/trade-agreements-accords-commerciaux/agr-acc/honduras/labour-travail.aspx?lang=eng"
[8] "http://international.gc.ca/trade-commerce/assets/pdfs/agreements-accords/korea-coree/18_CKFTA_EN.pdf"

This means that when you are running paste("https://www.canada.ca",x,sep="") on e.g. link 7, you get:

"https://www.canada.cahttp://www.international.gc.ca/trade-agreements-accords-commerciaux/agr-acc/honduras/labour-travail.aspx?lang=eng"

Assuming you want to keep that link you might change your lapply to:

treaties_links_full <- lapply(
    treaties_links, 
    function(x) {
    ifelse(
        substr(x,1,1)=="/", 
        paste("https://www.canada.ca",x,sep=""),
        x
        )
    } 
)

This will only prepend "https://www.canada.ca" to the links within that domain.