I am working with the R programming language. I am trying to scrape the following website: https://en.wikipedia.org/wiki/List_of_unincorporated_communities_in_Ontario
I tried the code below:
library(rvest)
url<-"https://en.wikipedia.org/wiki/List_of_unincorporated_communities_in_Ontario"
page <-read_html(url)
#find the div tab of class=one_third
b = page %>% html_nodes("li")
This seems to have produced some result, but I am not sure what to do with this.
Ideally, I would like the final results to look something like this:
id names
1 Aberdeen
2 Grey County
3 Aberdeen
4 Prescott and Russell County
5 Aberfeldy
... ...
6 Babys Point
7 Baddow
8 Baden
... ......
Can someone please show me how to do this?
Thanks!
CodePudding user response:
You can find the appropriate names as anchor elements nested within list elements using css or xpath selectors. Then, extract these using html_text
. Here's a full reprex:
library(rvest)
result <- "https://en.wikipedia.org/wiki/" %>%
paste0("List_of_unincorporated_communities_in_Ontario") %>%
read_html %>%
html_elements(xpath = '//ul/li/a') %>%
html_text() %>%
`[`(-(1:29)) %>%
as.data.frame() %>%
setNames('Community')
head(result, 10)
#> Community
#> 1 10th Line Shore
#> 2 Aberdeen, Grey County
#> 3 Aberdeen, Prescott and Russell County
#> 4 Aberfeldy
#> 5 Aberfoyle
#> 6 Abingdon
#> 7 Abitibi 70
#> 8 Abitibi Canyon
#> 9 Aboyne
#> 10 Acanthus
Created on 2022-09-15 with reprex v2.0.2