I am working with the R programming language.
I am trying to webscrape the second table from wikipedia.
Below, I outline the strategy I used in two different methods (Method 1, Method 2) I attempted while trying to scrape this table:
# METHOD 1
library(rvest)
url <- "https://en.wikipedia.org/wiki/List_of_municipalities_in_Ontario"
html <- read_html(url)
final <- data.frame(html %>%
html_element("table.wikitable.sortable") %>%
html_table())
> dim(final)
[1] 33 7
In Method 1, the code seemed to run, but the table appears to be a lot "smaller" (i.e. fewer rows) than the actual table on the wikipedia page.
I then tried the following code:
# METHOD 2
library(httr)
library(XML)
r <- GET(url)
final <- readHTMLTable(
doc=content(r, "text"))
In Method 2, the table appears to be significantly "bigger" than the previous result (I am still not sure if all the rows of the table were included):
111 9,545 9,631 -0.9% 555.96 17.2/km2
[ reached 'max' / getOption("max.print") -- omitted 307 rows ]
But when I tried to save the results of Method 2 as a data frame, I get the following error:
final = data.frame(final)
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, :
arguments imply differing number of rows: 34, 418, 14, 8, 4
Can someone please show me what I am doing wrong and how I can fix this?
Thanks!
References:
CodePudding user response:
Here is a way. It extracts both tables to a list and in the end uses the standard extraction operator [
to get the 2nd table. This extracts a sublist, to extract the table itself use [[
.
suppressPackageStartupMessages({
library(rvest)
library(dplyr)
})
url <- "https://en.wikipedia.org/wiki/List_of_municipalities_in_Ontario"
html <- read_html(url)
html %>%
html_elements(".wikitable") %>%
html_table() -> wikitables
html %>%
html_elements(".wikitable") %>%
html_element("caption") %>%
html_text() %>%
sub("\\n$", "", .) -> wikitables_names
names(wikitables) <- wikitables_names
wikitables[2]
#> $`List of local municipalities in Ontario`
#> # A tibble: 417 × 9
#> `Name[12]` Statu…¹ CSD t…² Censu…³ 2021 …⁴ 2021 …⁵ 2021 …⁶ 2021 …⁷ 2021 …⁸
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 Name[12] Status… CSD ty… Census… Popula… Popula… Change Land a… Popula…
#> 2 Addington Hi… Lower-… Townsh… Lennox… 2,534 2,318 9.3% 1,293.… 2.0/km2
#> 3 Adelaide Met… Lower-… Townsh… Middle… 3,011 2,990 0.7% 331.11 9.1/km2
#> 4 Adjala-Tosor… Lower-… Townsh… Simcoe 10,989 10,975 0.1% 371.53 29.6/k…
#> 5 Admaston/Bro… Lower-… Townsh… Renfrew 2,995 2,935 2.0% 519.59 5.8/km2
#> 6 Ajax Lower-… Town Durham 126,666 119,677 5.8% 66.64 1,900.…
#> 7 Alberton Single… Townsh… Rainy … 954 969 −1.5% 116.60 8.2/km2
#> 8 Alfred and P… Lower-… Townsh… Presco… 9,949 9,680 2.8% 391.79 25.4/k…
#> 9 Algonquin Hi… Lower-… Townsh… Halibu… 2,588 2,351 10.1% 999.69 2.6/km2
#> 10 Alnwick/Hald… Lower-… Townsh… Northu… 7,473 6,869 8.8% 398.25 18.8/k…
#> # … with 407 more rows, and abbreviated variable names ¹`Status[12]`,
#> # ²`CSD type[4]`, ³`Census division[32][33][34]`,
#> # ⁴`2021 Census of Population[4]`, ⁵`2021 Census of Population[4]`,
#> # ⁶`2021 Census of Population[4]`, ⁷`2021 Census of Population[4]`,
#> # ⁸`2021 Census of Population[4]`
Created on 2022-09-13 with reprex v2.0.2
To coerce the tables to class "data.frame"
use the following instead.
html %>%
html_elements(".wikitable") %>%
html_table() %>%
purrr::map(as.data.frame) -> wikitables