Home > database >  Optimal way to extract multiple elements from an object in a pipeline (%>%)
Optimal way to extract multiple elements from an object in a pipeline (%>%)

Time:12-24

I already have a solution that works, but I would like to know if there is a better way to achieve the desired result.

library(tidyverse)
library(httr2)
library(rvest)

"https://www.finn.no/realestate/homes/search.html?sort=PUBLISHED_DESC" %>% 
  request() %>%  
  req_perform() %>% 
  resp_body_html() %>% 
  html_elements(".list:nth-child(13) label")

I would like to extract both the text and its attribute for in the same pipeline. Individually you would:

element %>% 
  html_text2()

element %>% 
  html_attr("for")

tibble(
  area = element %>% 
    html_text2(), 
  area_code = element %>% 
    html_attr("for")
)

# A tibble: 11 × 2
   area                         area_code       
   <chr>                        <chr>           
 1 Agder (1 748)                location-0.22042
 2 Innlandet (1 967)            location-0.22034
 3 Møre og Romsdal (1 314)      location-0.20015
 4 Nordland (974)               location-0.20018
 5 Oslo (1 622)                 location-0.20061
 6 Rogaland (2 632)             location-0.20012
 7 Troms og Finnmark (1 167)    location-0.22054
 8 Trøndelag (2 334)            location-0.20016
 9 Vestfold og Telemark (2 016) location-0.22038
10 Vestland (2 595)             location-0.22046
11 Viken (7 198)                location-0.22030

One way I manage to do it in one pipeline (is this correct to say?) is as follows:

"https://www.finn.no/realestate/homes/search.html?sort=PUBLISHED_DESC" %>% 
  request() %>%  
  req_perform() %>% 
  resp_body_html() %>% 
  html_elements(".list:nth-child(13) label") %>% 
  map_dfr(~ tibble(
    area = .x %>% html_text2(), 
    area_code = .x %>% html_attr("for")
  ))

This yields an average computation time of ~ 0.17-0.2 seconds. I would like to optimize the code since I will be scraping a lot of pages from this site. Is there a better to achieve this?

I benchmarked with the package tictoc

CodePudding user response:

since you already have a list mapping each item to a tibble is overkill and computational expensive, you could use a apply family function (for example lapply) instead on the list and create the dataframe afterwards. this will be approx. 3 times faster

library(tidyr)
library(httr2)
library(rvest)
library(purrr)
library(microbenchmark)

# I download the data once and store into obj "elements" because i dont want to ddos the website
"https://www.finn.no/realestate/homes/search.html?sort=PUBLISHED_DESC" %>% 
  request() %>%  
  req_perform() %>% 
  resp_body_html() %>% 
  html_elements(".list:nth-child(13) label") -> elements

microbenchmark(

  # lapply
  elements %>% 
    lapply(., \(x) c("area" = html_text2(x),
                     "area_code" = html_attr(x,"for"))
    ) %>% 
    do.call(rbind,.)
  ,
  ## map
  elements %>% 
    map_dfr(~ tibble(
      area = .x %>% html_text2(), 
      area_code = .x %>% html_attr("for")
    ))
  ,times= 500L
)

      min        lq     mean    median        uq      max neval cld
  3.836963  3.999723  4.46171  4.116454  4.315466 15.41846   500  a 
 11.863318 12.375467 13.20117 12.653064 13.049080 23.48040   500   b


  • Related