I've been trying to learn webscraping from an online course, and they give the following as an example
url <- "https://www.canada.ca/en/employment-social-development/services/labour-relations/international/agreements.html" website<- read_html(url) treaties_links <- website %>% html_nodes("li") %>% html_nodes("a") %>% html_attr("href") treaties_links <-treaties_links[23:30] treaties_links_full <- lapply(treaties_links, function(x) (paste("https://www.canada.ca",x,sep=""))) treaties_links_full[8] <-treaties_links[8] treaty_texts <- lapply(treaties_links_full, function(x) (read_html(x)))
When I get to this last line it returns an error
Error in open.connection(x, "rb") : Could not resolve host: www.canada.cahttp
CodePudding user response:
Your error is in your lapply()
code. If you print treaties_links
, you will see that they are not all internal links, i.e. links starting with /
, and some are links to other domains:
print(treaties_links)
[1] "/en/employment-social-development/services/labour-relations/international/agreements/chile.html"
[2] "/en/employment-social-development/services/labour-relations/international/agreements/costa-rica.html"
[3] "/en/employment-social-development/services/labour-relations/international/agreements/peru.html"
[4] "/en/employment-social-development/services/labour-relations/international/agreements/colombia.html"
[5] "/en/employment-social-development/services/labour-relations/international/agreements/jordan.html"
[6] "/en/employment-social-development/services/labour-relations/international/agreements/panama.html"
[7] "http://www.international.gc.ca/trade-agreements-accords-commerciaux/agr-acc/honduras/labour-travail.aspx?lang=eng"
[8] "http://international.gc.ca/trade-commerce/assets/pdfs/agreements-accords/korea-coree/18_CKFTA_EN.pdf"
This means that when you are running paste("https://www.canada.ca",x,sep="")
on e.g. link 7, you get:
"https://www.canada.cahttp://www.international.gc.ca/trade-agreements-accords-commerciaux/agr-acc/honduras/labour-travail.aspx?lang=eng"
Assuming you want to keep that link you might change your lapply
to:
treaties_links_full <- lapply(
treaties_links,
function(x) {
ifelse(
substr(x,1,1)=="/",
paste("https://www.canada.ca",x,sep=""),
x
)
}
)
This will only prepend "https://www.canada.ca"
to the links within that domain.