Home > Back-end >  R: Webscraping a List From Wikipedia
R: Webscraping a List From Wikipedia

Time:09-16

I am working with the R programming language. I am trying to scrape the following website: https://en.wikipedia.org/wiki/List_of_unincorporated_communities_in_Ontario

I tried the code below:

library(rvest)

url<-"https://en.wikipedia.org/wiki/List_of_unincorporated_communities_in_Ontario"
page <-read_html(url)

#find the div tab of class=one_third
b = page %>% html_nodes("li") 

This seems to have produced some result, but I am not sure what to do with this.

Ideally, I would like the final results to look something like this:

  id                       names
  1                    Aberdeen
  2                 Grey County
  3                    Aberdeen
  4 Prescott and Russell County
  5                   Aberfeldy
...                      ...
  6                 Babys Point
  7                      Baddow
  8                       Baden
...                       ......

Can someone please show me how to do this?

Thanks!

CodePudding user response:

You can find the appropriate names as anchor elements nested within list elements using css or xpath selectors. Then, extract these using html_text. Here's a full reprex:

library(rvest)

result <- "https://en.wikipedia.org/wiki/" %>%
  paste0("List_of_unincorporated_communities_in_Ontario") %>%
  read_html %>% 
  html_elements(xpath = '//ul/li/a') %>% 
  html_text() %>%
  `[`(-(1:29)) %>%
  as.data.frame() %>%
  setNames('Community')

head(result, 10)
#>                                Community
#> 1                        10th Line Shore
#> 2                  Aberdeen, Grey County
#> 3  Aberdeen, Prescott and Russell County
#> 4                              Aberfeldy
#> 5                              Aberfoyle
#> 6                               Abingdon
#> 7                             Abitibi 70
#> 8                         Abitibi Canyon
#> 9                                 Aboyne
#> 10                              Acanthus

Created on 2022-09-15 with reprex v2.0.2

  • Related