I already have a solution that works, but I would like to know if there is a better way to achieve the desired result.
library(tidyverse)
library(httr2)
library(rvest)
"https://www.finn.no/realestate/homes/search.html?sort=PUBLISHED_DESC" %>%
request() %>%
req_perform() %>%
resp_body_html() %>%
html_elements(".list:nth-child(13) label")
I would like to extract both the text
and its attribute for
in the same pipeline. Individually you would:
element %>%
html_text2()
element %>%
html_attr("for")
tibble(
area = element %>%
html_text2(),
area_code = element %>%
html_attr("for")
)
# A tibble: 11 × 2
area area_code
<chr> <chr>
1 Agder (1 748) location-0.22042
2 Innlandet (1 967) location-0.22034
3 Møre og Romsdal (1 314) location-0.20015
4 Nordland (974) location-0.20018
5 Oslo (1 622) location-0.20061
6 Rogaland (2 632) location-0.20012
7 Troms og Finnmark (1 167) location-0.22054
8 Trøndelag (2 334) location-0.20016
9 Vestfold og Telemark (2 016) location-0.22038
10 Vestland (2 595) location-0.22046
11 Viken (7 198) location-0.22030
One way I manage to do it in one pipeline (is this correct to say?) is as follows:
"https://www.finn.no/realestate/homes/search.html?sort=PUBLISHED_DESC" %>%
request() %>%
req_perform() %>%
resp_body_html() %>%
html_elements(".list:nth-child(13) label") %>%
map_dfr(~ tibble(
area = .x %>% html_text2(),
area_code = .x %>% html_attr("for")
))
This yields an average computation time of ~ 0.17-0.2 seconds. I would like to optimize the code since I will be scraping a lot of pages from this site. Is there a better to achieve this?
I benchmarked with the package tictoc
CodePudding user response:
since you already have a list mapping each item to a tibble is overkill and computational expensive, you could use a apply family function (for example lapply) instead on the list and create the dataframe afterwards. this will be approx. 3 times faster
library(tidyr)
library(httr2)
library(rvest)
library(purrr)
library(microbenchmark)
# I download the data once and store into obj "elements" because i dont want to ddos the website
"https://www.finn.no/realestate/homes/search.html?sort=PUBLISHED_DESC" %>%
request() %>%
req_perform() %>%
resp_body_html() %>%
html_elements(".list:nth-child(13) label") -> elements
microbenchmark(
# lapply
elements %>%
lapply(., \(x) c("area" = html_text2(x),
"area_code" = html_attr(x,"for"))
) %>%
do.call(rbind,.)
,
## map
elements %>%
map_dfr(~ tibble(
area = .x %>% html_text2(),
area_code = .x %>% html_attr("for")
))
,times= 500L
)
min lq mean median uq max neval cld
3.836963 3.999723 4.46171 4.116454 4.315466 15.41846 500 a
11.863318 12.375467 13.20117 12.653064 13.049080 23.48040 500 b