Home > Blockchain >  R: Errors when webscraping across mulitple tables with same URL
R: Errors when webscraping across mulitple tables with same URL

Time:02-27

I'm fairly new to webscraping and having issues troubleshooting my code. At the moment I'm having different errors every time and don't really know where to continue. Currently looking into utilizing RSelenium but would greatly appreciate some advise and feedback on the code below.

Based my initial code on the following: R: How to web scrape a table across multiple pages with the same URL

library(xml2)
library(RCurl)
library(dplyr)
library(rvest)

i=1
table = list()
for (i in 1:15) {
  data=("https://www.forsvarsbygg.no/no/salg-av-eiendom/solgte-eiendommer/","?page=",i))
  page <- read_html(data)
  table1 <- page %>%
    html_nodes(xpath = "(//table)[2]") %>%
    html_table(header=T)
  i=i 1
  table1[[1]][[7]]=as.integer(gsub(",", "",table1[[1]][[7]]))
  table=bind_rows(table, table1)
  print(i)}

table$`ÅR`=as.Date(table$`ÅR`,format ="%Y")

Bellow are the errors i am recieving at the moment. I know its a lot, but i assume some of them are a result of previous errors. Any help would be greatly appreciated!

i=1

table = list() for (i in 1:15) {

  • data=("https://www.forsvarsbygg.no/no/salg-av-eiendom/solgte-eiendommer/","?page=",i))

Error: unexpected ',' in: "for (i in 1:15) { data=("https://www.forsvarsbygg.no/no/salg-av-eiendom/solgte-eiendommer/","

page <- read_html(data)

Error in UseMethod("read_xml") : no applicable method for 'read_xml' applied to an object of class "function"

table1 <- page %>%

  • html_nodes(xpath = "(//table)[2]") %>%
    
  • html_table(header=T)
    

Error in UseMethod("xml_find_all") : no applicable method for 'xml_find_all' applied to an object of class "function"

i=i 1 table1[[1]][[7]]=as.integer(gsub(",", "",table1[[1]][[7]]))

Error in is.factor(x) : object 'table1' not found

table=bind_rows(table, table1)

Error in list2(...) : object 'table1' not found

print(i)}

Error: unexpected '}' in " print(i)}"

table$ÃR=as.Date(table$ÃR,format ="%Y")

CodePudding user response:

Heihei, Einar! Ser ut som en spennende oppgave.

I would strongly advice you to use RSelenium to scrape this site, as it is a dynamic one. I have provided the code in order to to so. Firefox is my preferred browser for the job, but Chrome is also sufficient. You can execute the whole code at once.

As you can see on the embedded picture, the output has 1402 observations and is complete. But, you will need to clean the table as the header is somehow included in the observations.

require(RSelenium)
require(rvest) 
require(tidyverse)
require(netstat)

url <- "https://www.forsvarsbygg.no/no/salg-av-eiendom/solgte-eiendommer/"

rD <- rsDriver(browser = "firefox", port = free_port())
remDr <- rD[["client"]]

remDr$navigate(URL) # Go to website

Sys.sleep(1)
df <- remDr$getPageSource()[[1]] %>% # Get source data from page 1
  read_html() %>% 
  html_table() %>%
  .[[1]] # Grab the table

Sys.sleep(1)

indexes <- c(1:6, rep(6, 8)) # Wierd index on the website
# You cant easily loop 1:15 pages, since when you navigate to page 7, 
# the website indexes page 8 as 6 throughout page 15

get_table <- function(page_i) {
  page <- remDr$findElement(
    using = "xpath",
    paste0(
      '//*[@id="page"]/div[2]/div[2]/div/div/div[2]/div/div[3]/span/a[',
      page_i, # Indexes from page 2 to 15. 
      ']'
    )
  )
  page$clickElement()
  Sys.sleep(1)
  page_df <- remDr$getPageSource()[[1]] %>%
    read_html() %>%
    html_table() %>%
    .[[1]]
  
  return(page_df)
}

df_2 <- map_df(indexes, get_table) # Map over loop, always. 
final_df <- rbind(df, df_2) # Row bind the first page with the rest

remDr$close() # Shutdown RSelenium

EDIT: This is how I would clean the data afterwards.

final_df %>% 
  janitor::clean_names() %>%
  mutate(eiendomstype = eiendomstype %>% str_remove_all("Eiendomstype"),
         kommune = kommune %>% str_remove_all("Kommune"), 
         fylke = fylke %>% str_remove_all("Fylke"),
         takst = takst %>% 
           str_remove_all(" ") %>% parse_number(),
         salgssum = takst %>% 
           str_remove_all(" ") %>% parse_number(),
         ar = ar %>% parse_number())

Picture of the output

CodePudding user response:

The following code produces a dataframe containing all the data you are seeking. Rather than using RSelenium, the below code fetches the data directly from the same API from which the site populates the table and so you do not need to combine multiple pages:

library(tidyverse)
library(rvest)
library(jsonlite)

####GET NUMBER OF ITEMS#####

url <- "https://www.forsvarsbygg.no/ListApi/ListContent/78635/SoldEstates/0/10/"

data <- jsonlite::fromJSON(url, flatten = TRUE)

totalItems <- data$TotalNumberOfItems

####GET ALL OF THE ITEMS#####

allData <- paste0('https://www.forsvarsbygg.no/ListApi/ListContent/78635/SoldEstates/0/', totalItems,'/') %>%
  jsonlite::fromJSON(., flatten = TRUE) %>%
  .[1] %>%
  as.data.frame() %>%
  rename_with(~str_replace(., "ListItems.", ""), everything())
  • Related