Home > Mobile >  Webscraping R: no applicable method for 'read_xml' applied to an object of class "lis
Webscraping R: no applicable method for 'read_xml' applied to an object of class "lis

Time:11-09

I have this website over here: enter image description here

For example, manually inspecting the <div class = "cardcon"> section, I found the following links I needed:

# desired results
- https://www.realtor.ca/real-estate/25050003/lot-1-norcross-rd-duncan-west-duncan
- https://www.realtor.ca/real-estate/25050002/39-legacy-lane-hamilton-ancaster
- https://www.realtor.ca/real-estate/25049996/53-16-fourth-st-orangeville-orangeville
- etc.

I noticed that all these desired links are contained within the following type of HTML structure: <a href="*****INSERT LINK HERE****" data-binding="href=DetailsURL" target="_blank">

I had the following question: Using the R programming language, would it be possible to save every link on this page which is contained within this <a href = .... target="_blank"> structure?

For example - I tried this code here:

library(rvest)
library(httr)
library(XML)

url<-"https://www.realtor.ca/map#ZoomLevel=4&Center=58.695434,-96.000000&LatitudeMax=72.60462&LongitudeMax=-26.39063&LatitudeMin=35.66836&LongitudeMin=-165.60938&Sort=6-D&PropertyTypeGroupID=1&PropertySearchTypeId=1&TransactionTypeId=2&Currency=CAD"

# making http request
resource <- GET(url)

# converting all the data to HTML format
parse <- htmlParse(resource)

# scrapping all the href tags
links <- xpathSApply(parse, path="//a", xmlGetAttr, "href")

page <-read_html(links)

Error in UseMethod("read_xml") : 
  no applicable method for 'read_xml' applied to an object of class "list"

But I am not sure how to complete this.

Can someone please show me what to do next?

Thank you!

CodePudding user response:

Even though the link is URL-encoded, it is better to call on their API. Check out the network section - you will find this:

API picture

The parameters that is encoded in your URL can be found in the payload tab. With httr2 you can retrieve the same information as the site.

library(tidyverse)
library(httr2)

content <- "https://api2.realtor.ca/Listing.svc/PropertySearch_Post" %>%
  request() %>%
  req_body_form(
    ZoomLevel = 4,
    LatitudeMax = '67.16743',
    LongitudeMax = '-56.40166',
    LatitudeMin = '-5.70993',
    LongitudeMin = '-139.10674',
    CurrentPage = 2,
    Sort = '6-D',
    PropertyTypeGroupID = 1,
    ropertySearchTypeId = 1,
    TransactionTypeId = 2,
    Currency = 'CAD',
    RecordsPerPage = 12,
    ApplicationId = 1,
    CultureId = 1,
    Version = '7.0'
  ) %>%
  req_headers('referer' = 'https://www.realtor.ca/') %>%
  req_perform() %>%
  resp_body_json(simplifyVector = TRUE)

content %>% 
  getElement('Results') %>%  
  as_tibble

# A tibble: 12 x 21
   Id       MlsNum~1 Publi~2 Build~3 Indiv~4 Prope~5 Busin~6 Land$~7 Posta~8 Relat~9 Statu~*
   <chr>    <chr>    <chr>   <chr>   <list>  <chr>   <df[,0> <chr>   <chr>   <chr>   <chr>  
 1 25049990 W5821020 "Rare ~ 3       <df>    $1,275~         30.2 x~ L9E1J1  /real-~ 1      
 2 25049994 W5821034 "The B~ 3       <df>    $1,499~         29.5 x~ L6M2Z8  /real-~ 1      
 3 25049980 N5821033 "Immac~ 3       <df>    $999,0~         20.01 ~ L4S2K9  /real-~ 1      
 4 25049978 N5821026 "Ravin~ 3       <df>    $990,0~         24.64 ~ L6B0G6  /real-~ 1      
 5 25049977 N5821022 "A Rea~ 2       <df>    $599,9~         NA      L3T4S3  /real-~ 1      
 6 25049976 N5821019 "**** ~ 4       <df>    $1,468~         40.03 ~ L3X2H9  /real-~ 1      
 7 25049973 E5821030 "7 Yea~ 4       <df>    $1,799~         151.71~ L0B1A0  /real-~ 1      
 8 25049971 E5821014 "This ~ 3       <df>    $849,0~         27.1 x~ M4J4C3  /real-~ 1      
 9 25049966 C5821039 "Brigh~ 1       <df>    $568,8~         NA      M2N0L2  /real-~ 1      
10 25049967 C5821042 "Come ~ 1       <df>    $599,0~         NA      M5V0G8  /real-~ 1      
11 25049965 C5821029 "Wow! ~ 1       <df>    $514,9~         NA      M3C1S5  /real-~ 1      
12 25049963 C5821025 "High ~ 2       <df>    $1,199~         NA      M5C0A6  /real-~ 1      
# ... with 26 more variables: Building$Bedrooms <chr>, $StoriesTotal <chr>, $Type <chr>,
#   $Ammenities <chr>, Property$Type <chr>, $Address <df[,5]>, $Photo <list>,
#   $Parking <list>, $ParkingSpaceTotal <chr>, $TypeId <chr>, $OwnershipType <chr>,
#   $ConvertedPrice <chr>, $OwnershipTypeGroupIds <list>, $ParkingType <chr>,
#   $PriceUnformattedValue <chr>, $AmmenitiesNearBy <chr>, PhotoChangeDateUTC <chr>,
#   HasNewImageUpdate <lgl>, Distance <chr>, RelativeURLEn <chr>, RelativeURLFr <chr>,
#   Media <list>, InsertedDateUTC <chr>, TimeOnRealtor <chr>, Tags <list>, ...
# i Use `colnames()` to see all variable names
  • Related