Using xpath in R to scrape data from website with multiple similar paths-CodePudding

I'm trying to scrape in R a list of apartments for sale and the basic info (address, m2, price, rooms, etc.) of this website: https://www.boligsiden.dk/tilsalg/ejerlejlighed?sortAscending=true&priceMin=3000000&priceMax=7000000 (see also below a screenshot of the page inspect)

Using SelectorGadget i haven't been able to create a path that unique extracts the square meters of all 50 apartments on page 1, and another path that unique extracts the numbers of rooms, etc.

I did manage to find a path that unique extracts the addresses (see in code block below). But this is in a separate block/class from the rest of the text.

Here is my current code:

library(rvest)
library(dplyr)

link = "https://www.boligsiden.dk/tilsalg/ejerlejlighed?sortAscending=true&priceMin=3000000&priceMax=7000000&page=1"
page = read_html(link)
address = page %>% html_nodes("div.mr-2") %>% html_text()
price = #MISSING - CAN'T FIGURE OUT
sqm = #MISSING - CAN'T FIGURE OUT
rooms = #MISSING - CAN'T FIGURE OUT
forsale = data.frame(address, price, sqm, rooms, stringsAsFactors = FALSE)

Any ideas on how to approach it? I tried using xpath as well to extract the sqm, but only managed to get one specific text field extracted, not all 50 on the page.

Alternative approaches are welcome too. Thanks in advance!

CodePudding user response：

Using their API (found in the network section), you can call on it and retrieve in the information as such:

library(tidyverse)
library(httr2)

"https://api.prod.bs-aws-stage.com/search/cases?addressTypes=condo&priceMax=7000000&priceMin=3000000&per_page=100&page=1&sortAscending=true&sortBy=timeOnMarket" %>%
  request() %>%
  req_perform() %>%
  resp_body_json(simplifyVector = TRUE) %>%
  pluck("cases") %>%
  unnest(address, names_sep = "_") %>%
  mutate(
    address = str_c(address_roadName, address_houseNumber, address_zipCode, sep = " "),
    .before = 1
  ) %>%
  select(address,
         price = priceCash,
         sqm = housingArea,
         rooms = numberOfRooms)

# A tibble: 100 × 4
   address                       price   sqm rooms
   <chr>                         <int> <int> <int>
 1 Holsteinsgade 66 2100       3135000    56     2
 2 Tuborgvej 60 2900           4875000   114     4
 3 Poppellunden 8 4000         3350000    92     3
 4 Hyldegårds Tværvej 5 2920   6498000   115     3
 5 Grollowstræde 3 3000        3495000    92     3
 6 Rasmus Rasks Vej 8 2500     3995000    80     3
 7 Ryesgade 7 8000             4598000   110     4
 8 Carl Th. Zahles Gade 8 2300 5795000   113     3
 9 Strandlodsvej 23E 2300      5495000   101     3
10 Nordre Fasanvej 162 2000    4695000    90     4
# … with 90 more rows
# ℹ Use `print(n = ...)` to see more rows

CodePudding user response：

Selectors are kind of convoluted and fragile, but for now it seems to work:

library(rvest)
library(dplyr)
library(purrr)
library(stringr)

url <- "https://www.boligsiden.dk/tilsalg/ejerlejlighed?sortAscending=true&priceMin=3000000&priceMax=7000000"
html <- read_html(url)
html |> html_elements("div.shadow.overflow-hidden.mx-4") |>
  map_dfr(\(x)
    list( 
      "address" = html_element(x ,"div.mr-2") |> html_text2() |> str_squish(),
      "price"   = html_element(x ,"span.text-lg.pr-2") |> html_text(),
      "sqm"     = html_element(x ,"div.hidden.grid-cols-5.grid-rows-2 > div:nth-child(1) .text-sm" ) |> html_text(),
      "rooms"   = html_element(x ,"div.hidden.grid-cols-5.grid-rows-2 > div:nth-child(4) .text-sm" ) |> html_text()
      )
    )
#> # A tibble: 50 × 4
#>    address                                           price         sqm    rooms 
#>    <chr>                                             <chr>         <chr>  <chr> 
#>  1 Poppellunden 8, 4. tv. Himmelev, 4000 Roskilde    3.350.000 kr. 92 m²  3 Vær.
#>  2 Tuborgvej 60, 2. th. 2900 Hellerup                4.875.000 kr. 114 m² 4 Vær.
#>  3 Hyldegårds Tværvej 5, st. tv. 2920 Charlottenlund 6.498.000 kr. 115 m² 3 Vær.
#>  4 Grollowstræde 3 3000 Helsingør                    3.495.000 kr. 92 m²  3 Vær.
#>  5 Ryesgade 7, 2. tv. 8000 Aarhus C                  4.598.000 kr. 110 m² 4 Vær.
#>  6 Carl Th. Zahles Gade 8, 2. tv. 2300 København S   5.795.000 kr. 113 m² 3 Vær.
#>  7 Rasmus Rasks Vej 8, 2. tv. 2500 Valby             3.995.000 kr. 80 m²  3 Vær.
#>  8 Strandlodsvej 23E, 1. mf. 2300 København S        5.495.000 kr. 101 m² 3 Vær.
#>  9 Nordre Fasanvej 162, 3. th. 2000 Frederiksberg    4.695.000 kr. 90 m²  4 Vær.
#> 10 Ringstedgade 17B, 1. th. 4000 Roskilde            5.395.000 kr. 137 m² 5 Vær.
#> # … with 40 more rows

^{Created on 2023-02-01 with reprex v2.0.2}