R: Webscraping a List From Wikipedia-CodePudding

I am working with the R programming language. I am trying to scrape the following website: https://en.wikipedia.org/wiki/List_of_unincorporated_communities_in_Ontario

I tried the code below:

library(rvest)

url<-"https://en.wikipedia.org/wiki/List_of_unincorporated_communities_in_Ontario"
page <-read_html(url)

#find the div tab of class=one_third
b = page %>% html_nodes("li")

This seems to have produced some result, but I am not sure what to do with this.

Ideally, I would like the final results to look something like this:

  id                       names
  1                    Aberdeen
  2                 Grey County
  3                    Aberdeen
  4 Prescott and Russell County
  5                   Aberfeldy
...                      ...
  6                 Babys Point
  7                      Baddow
  8                       Baden
...                       ......

Can someone please show me how to do this?

Thanks!

CodePudding user response：

You can find the appropriate names as anchor elements nested within list elements using css or xpath selectors. Then, extract these using html_text. Here's a full reprex:

library(rvest)

result <- "https://en.wikipedia.org/wiki/" %>%
  paste0("List_of_unincorporated_communities_in_Ontario") %>%
  read_html %>% 
  html_elements(xpath = '//ul/li/a') %>% 
  html_text() %>%
  `[`(-(1:29)) %>%
  as.data.frame() %>%
  setNames('Community')

head(result, 10)
#>                                Community
#> 1                        10th Line Shore
#> 2                  Aberdeen, Grey County
#> 3  Aberdeen, Prescott and Russell County
#> 4                              Aberfeldy
#> 5                              Aberfoyle
#> 6                               Abingdon
#> 7                             Abitibi 70
#> 8                         Abitibi Canyon
#> 9                                 Aboyne
#> 10                              Acanthus

^{Created on 2022-09-15 with reprex v2.0.2}