I'm trying to scrape in R a list of apartments for sale and the basic info (address, m2, price, rooms, etc.) of this website: https://www.boligsiden.dk/tilsalg/ejerlejlighed?sortAscending=true&priceMin=3000000&priceMax=7000000 (see also below a screenshot of the page inspect)
Using SelectorGadget i haven't been able to create a path that unique extracts the square meters of all 50 apartments on page 1, and another path that unique extracts the numbers of rooms, etc.
I did manage to find a path that unique extracts the addresses (see in code block below). But this is in a separate block/class from the rest of the text.
Here is my current code:
library(rvest)
library(dplyr)
link = "https://www.boligsiden.dk/tilsalg/ejerlejlighed?sortAscending=true&priceMin=3000000&priceMax=7000000&page=1"
page = read_html(link)
address = page %>% html_nodes("div.mr-2") %>% html_text()
price = #MISSING - CAN'T FIGURE OUT
sqm = #MISSING - CAN'T FIGURE OUT
rooms = #MISSING - CAN'T FIGURE OUT
forsale = data.frame(address, price, sqm, rooms, stringsAsFactors = FALSE)
Any ideas on how to approach it? I tried using xpath as well to extract the sqm, but only managed to get one specific text field extracted, not all 50 on the page.
Alternative approaches are welcome too. Thanks in advance!
CodePudding user response:
Using their API (found in the network section), you can call on it and retrieve in the information as such:
library(tidyverse)
library(httr2)
"https://api.prod.bs-aws-stage.com/search/cases?addressTypes=condo&priceMax=7000000&priceMin=3000000&per_page=100&page=1&sortAscending=true&sortBy=timeOnMarket" %>%
request() %>%
req_perform() %>%
resp_body_json(simplifyVector = TRUE) %>%
pluck("cases") %>%
unnest(address, names_sep = "_") %>%
mutate(
address = str_c(address_roadName, address_houseNumber, address_zipCode, sep = " "),
.before = 1
) %>%
select(address,
price = priceCash,
sqm = housingArea,
rooms = numberOfRooms)
# A tibble: 100 × 4
address price sqm rooms
<chr> <int> <int> <int>
1 Holsteinsgade 66 2100 3135000 56 2
2 Tuborgvej 60 2900 4875000 114 4
3 Poppellunden 8 4000 3350000 92 3
4 Hyldegårds Tværvej 5 2920 6498000 115 3
5 Grollowstræde 3 3000 3495000 92 3
6 Rasmus Rasks Vej 8 2500 3995000 80 3
7 Ryesgade 7 8000 4598000 110 4
8 Carl Th. Zahles Gade 8 2300 5795000 113 3
9 Strandlodsvej 23E 2300 5495000 101 3
10 Nordre Fasanvej 162 2000 4695000 90 4
# … with 90 more rows
# ℹ Use `print(n = ...)` to see more rows
CodePudding user response:
Selectors are kind of convoluted and fragile, but for now it seems to work:
library(rvest)
library(dplyr)
library(purrr)
library(stringr)
url <- "https://www.boligsiden.dk/tilsalg/ejerlejlighed?sortAscending=true&priceMin=3000000&priceMax=7000000"
html <- read_html(url)
html |> html_elements("div.shadow.overflow-hidden.mx-4") |>
map_dfr(\(x)
list(
"address" = html_element(x ,"div.mr-2") |> html_text2() |> str_squish(),
"price" = html_element(x ,"span.text-lg.pr-2") |> html_text(),
"sqm" = html_element(x ,"div.hidden.grid-cols-5.grid-rows-2 > div:nth-child(1) .text-sm" ) |> html_text(),
"rooms" = html_element(x ,"div.hidden.grid-cols-5.grid-rows-2 > div:nth-child(4) .text-sm" ) |> html_text()
)
)
#> # A tibble: 50 × 4
#> address price sqm rooms
#> <chr> <chr> <chr> <chr>
#> 1 Poppellunden 8, 4. tv. Himmelev, 4000 Roskilde 3.350.000 kr. 92 m² 3 Vær.
#> 2 Tuborgvej 60, 2. th. 2900 Hellerup 4.875.000 kr. 114 m² 4 Vær.
#> 3 Hyldegårds Tværvej 5, st. tv. 2920 Charlottenlund 6.498.000 kr. 115 m² 3 Vær.
#> 4 Grollowstræde 3 3000 Helsingør 3.495.000 kr. 92 m² 3 Vær.
#> 5 Ryesgade 7, 2. tv. 8000 Aarhus C 4.598.000 kr. 110 m² 4 Vær.
#> 6 Carl Th. Zahles Gade 8, 2. tv. 2300 København S 5.795.000 kr. 113 m² 3 Vær.
#> 7 Rasmus Rasks Vej 8, 2. tv. 2500 Valby 3.995.000 kr. 80 m² 3 Vær.
#> 8 Strandlodsvej 23E, 1. mf. 2300 København S 5.495.000 kr. 101 m² 3 Vær.
#> 9 Nordre Fasanvej 162, 3. th. 2000 Frederiksberg 4.695.000 kr. 90 m² 4 Vær.
#> 10 Ringstedgade 17B, 1. th. 4000 Roskilde 5.395.000 kr. 137 m² 5 Vær.
#> # … with 40 more rows
Created on 2023-02-01 with reprex v2.0.2