(R) Webscraping Error: "arguments imply differing number of rows"-CodePudding

I am working with the R programming language.

I am trying to webscrape the second table from wikipedia.

Below, I outline the strategy I used in two different methods (Method 1, Method 2) I attempted while trying to scrape this table:

# METHOD 1

library(rvest)

url <- "https://en.wikipedia.org/wiki/List_of_municipalities_in_Ontario"

html <- read_html(url)

final <- data.frame(html %>% 
    html_element("table.wikitable.sortable") %>% 
    html_table())

> dim(final)
[1] 33  7

In Method 1, the code seemed to run, but the table appears to be a lot "smaller" (i.e. fewer rows) than the actual table on the wikipedia page.

I then tried the following code:

# METHOD 2

library(httr)
library(XML)

r <- GET(url)

final <- readHTMLTable(
  doc=content(r, "text"))

In Method 2, the table appears to be significantly "bigger" than the previous result (I am still not sure if all the rows of the table were included):

111                        9,545   9,631  -0.9%   555.96    17.2/km2
 [ reached 'max' / getOption("max.print") -- omitted 307 rows ]

But when I tried to save the results of Method 2 as a data frame, I get the following error:

final = data.frame(final)

Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE,  : 
  arguments imply differing number of rows: 34, 418, 14, 8, 4

Can someone please show me what I am doing wrong and how I can fix this?

Thanks!

References:

CodePudding user response：

Here is a way. It extracts both tables to a list and in the end uses the standard extraction operator [ to get the 2nd table. This extracts a sublist, to extract the table itself use [[.

suppressPackageStartupMessages({
  library(rvest)
  library(dplyr)
})

url <- "https://en.wikipedia.org/wiki/List_of_municipalities_in_Ontario"

html <- read_html(url)

html %>%
  html_elements(".wikitable") %>%
  html_table() -> wikitables

html %>%
  html_elements(".wikitable") %>%
  html_element("caption") %>%
  html_text() %>%
  sub("\\n$", "", .) -> wikitables_names

names(wikitables) <- wikitables_names
wikitables[2]
#> $`List of local municipalities in Ontario`
#> # A tibble: 417 × 9
#>    `Name[12]`    Statu…¹ CSD t…² Censu…³ 2021 …⁴ 2021 …⁵ 2021 …⁶ 2021 …⁷ 2021 …⁸
#>    <chr>         <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>  
#>  1 Name[12]      Status… CSD ty… Census… Popula… Popula… Change  Land a… Popula…
#>  2 Addington Hi… Lower-… Townsh… Lennox… 2,534   2,318    9.3%   1,293.… 2.0/km2
#>  3 Adelaide Met… Lower-… Townsh… Middle… 3,011   2,990    0.7%   331.11  9.1/km2
#>  4 Adjala-Tosor… Lower-… Townsh… Simcoe  10,989  10,975   0.1%   371.53  29.6/k…
#>  5 Admaston/Bro… Lower-… Townsh… Renfrew 2,995   2,935    2.0%   519.59  5.8/km2
#>  6 Ajax          Lower-… Town    Durham  126,666 119,677  5.8%   66.64   1,900.…
#>  7 Alberton      Single… Townsh… Rainy … 954     969     −1.5%   116.60  8.2/km2
#>  8 Alfred and P… Lower-… Townsh… Presco… 9,949   9,680    2.8%   391.79  25.4/k…
#>  9 Algonquin Hi… Lower-… Townsh… Halibu… 2,588   2,351    10.1%  999.69  2.6/km2
#> 10 Alnwick/Hald… Lower-… Townsh… Northu… 7,473   6,869    8.8%   398.25  18.8/k…
#> # … with 407 more rows, and abbreviated variable names ¹`Status[12]`,
#> #   ²`CSD type[4]`, ³`Census division[32][33][34]`,
#> #   ⁴`2021 Census of Population[4]`, ⁵`2021 Census of Population[4]`,
#> #   ⁶`2021 Census of Population[4]`, ⁷`2021 Census of Population[4]`,
#> #   ⁸`2021 Census of Population[4]`

^{Created on 2022-09-13 with reprex v2.0.2}

To coerce the tables to class "data.frame" use the following instead.

html %>%
  html_elements(".wikitable") %>%
  html_table() %>%
  purrr::map(as.data.frame) -> wikitables