I have this website over here:
For example, manually inspecting the <div class = "cardcon">
section, I found the following links I needed:
# desired results
- https://www.realtor.ca/real-estate/25050003/lot-1-norcross-rd-duncan-west-duncan
- https://www.realtor.ca/real-estate/25050002/39-legacy-lane-hamilton-ancaster
- https://www.realtor.ca/real-estate/25049996/53-16-fourth-st-orangeville-orangeville
- etc.
I noticed that all these desired links are contained within the following type of HTML structure: <a href="*****INSERT LINK HERE****" data-binding="href=DetailsURL" target="_blank">
I had the following question: Using the R programming language, would it be possible to save every link on this page which is contained within this <a href = .... target="_blank">
structure?
For example - I tried this code here:
library(rvest)
library(httr)
library(XML)
url<-"https://www.realtor.ca/map#ZoomLevel=4&Center=58.695434,-96.000000&LatitudeMax=72.60462&LongitudeMax=-26.39063&LatitudeMin=35.66836&LongitudeMin=-165.60938&Sort=6-D&PropertyTypeGroupID=1&PropertySearchTypeId=1&TransactionTypeId=2&Currency=CAD"
# making http request
resource <- GET(url)
# converting all the data to HTML format
parse <- htmlParse(resource)
# scrapping all the href tags
links <- xpathSApply(parse, path="//a", xmlGetAttr, "href")
page <-read_html(links)
Error in UseMethod("read_xml") :
no applicable method for 'read_xml' applied to an object of class "list"
But I am not sure how to complete this.
Can someone please show me what to do next?
Thank you!
CodePudding user response:
Even though the link is URL-encoded, it is better to call on their API. Check out the network section - you will find this:
The parameters that is encoded in your URL can be found in the payload
tab. With httr2
you can retrieve the same information as the site.
library(tidyverse)
library(httr2)
content <- "https://api2.realtor.ca/Listing.svc/PropertySearch_Post" %>%
request() %>%
req_body_form(
ZoomLevel = 4,
LatitudeMax = '67.16743',
LongitudeMax = '-56.40166',
LatitudeMin = '-5.70993',
LongitudeMin = '-139.10674',
CurrentPage = 2,
Sort = '6-D',
PropertyTypeGroupID = 1,
ropertySearchTypeId = 1,
TransactionTypeId = 2,
Currency = 'CAD',
RecordsPerPage = 12,
ApplicationId = 1,
CultureId = 1,
Version = '7.0'
) %>%
req_headers('referer' = 'https://www.realtor.ca/') %>%
req_perform() %>%
resp_body_json(simplifyVector = TRUE)
content %>%
getElement('Results') %>%
as_tibble
# A tibble: 12 x 21
Id MlsNum~1 Publi~2 Build~3 Indiv~4 Prope~5 Busin~6 Land$~7 Posta~8 Relat~9 Statu~*
<chr> <chr> <chr> <chr> <list> <chr> <df[,0> <chr> <chr> <chr> <chr>
1 25049990 W5821020 "Rare ~ 3 <df> $1,275~ 30.2 x~ L9E1J1 /real-~ 1
2 25049994 W5821034 "The B~ 3 <df> $1,499~ 29.5 x~ L6M2Z8 /real-~ 1
3 25049980 N5821033 "Immac~ 3 <df> $999,0~ 20.01 ~ L4S2K9 /real-~ 1
4 25049978 N5821026 "Ravin~ 3 <df> $990,0~ 24.64 ~ L6B0G6 /real-~ 1
5 25049977 N5821022 "A Rea~ 2 <df> $599,9~ NA L3T4S3 /real-~ 1
6 25049976 N5821019 "**** ~ 4 <df> $1,468~ 40.03 ~ L3X2H9 /real-~ 1
7 25049973 E5821030 "7 Yea~ 4 <df> $1,799~ 151.71~ L0B1A0 /real-~ 1
8 25049971 E5821014 "This ~ 3 <df> $849,0~ 27.1 x~ M4J4C3 /real-~ 1
9 25049966 C5821039 "Brigh~ 1 <df> $568,8~ NA M2N0L2 /real-~ 1
10 25049967 C5821042 "Come ~ 1 <df> $599,0~ NA M5V0G8 /real-~ 1
11 25049965 C5821029 "Wow! ~ 1 <df> $514,9~ NA M3C1S5 /real-~ 1
12 25049963 C5821025 "High ~ 2 <df> $1,199~ NA M5C0A6 /real-~ 1
# ... with 26 more variables: Building$Bedrooms <chr>, $StoriesTotal <chr>, $Type <chr>,
# $Ammenities <chr>, Property$Type <chr>, $Address <df[,5]>, $Photo <list>,
# $Parking <list>, $ParkingSpaceTotal <chr>, $TypeId <chr>, $OwnershipType <chr>,
# $ConvertedPrice <chr>, $OwnershipTypeGroupIds <list>, $ParkingType <chr>,
# $PriceUnformattedValue <chr>, $AmmenitiesNearBy <chr>, PhotoChangeDateUTC <chr>,
# HasNewImageUpdate <lgl>, Distance <chr>, RelativeURLEn <chr>, RelativeURLFr <chr>,
# Media <list>, InsertedDateUTC <chr>, TimeOnRealtor <chr>, Tags <list>, ...
# i Use `colnames()` to see all variable names