I have some urls that I would like to scrape. I end up with 3 data frames (for example):
# A tibble: 255 × 7
id class tabindex role `aria-controls` style `data-testid`
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 App NA NA NA NA NA NA
2 NA re-AdTop1Container NA NA NA NA NA
3 NA re-AdTop1Container-block NA NA NA NA NA
4 NA re-AdvertisingDominanceCrossdevice-x65 NA NA NA NA NA
5 PubX65Detail_wrapper adit-XandrBanner adit-XandrBanner--notAvailable NA NA NA NA NA
6 PubX65Detail NA NA NA NA NA NA
7 NA re-AdvertisingDominanceCrossdevice-top1 NA NA NA NA NA
8 PubTop1_wrapper adit-XandrBanner adit-XandrBanner--notAvailable NA NA NA NA NA
9 PubTop1 NA NA NA NA NA NA
10 NA react-MoleculeDrawer-content react-MoleculeDrawer-content--placement-left react-MoleculeDrawer-content--size-auto react-Molecul… NA NA NA NA NA
I want to go over each row in the class
column and store in a new column the collected data.
i.e. I can manually collect the data using:
html3 %>%
html_nodes('.re-DetailHeader-propertyTitleContainer')
but I would like to preserve the "structure" of the rvest
collected data. I want to create a new column and keep all the saved html_nodes()
using the classes in the column class
.
Code:
url1 = "https://www.fotocasa.es/es/comprar/vivienda/madrid-capital/terraza-piscina/163103410/d"
url2 = "https://www.fotocasa.es/es/comprar/vivienda/elche---elx/calefaccion-terraza-ascensor-parking-internet-no-amueblado/162434119/d"
url3 = "https://www.fotocasa.es/es/comprar/vivienda/almoradi/terraza-trastero-ascensor-amueblado/163000099/d"
##### process url 1 #####
html1 = url1 %>%
read_html()
classAttrs_1 = html1 %>%
html_nodes('div') %>%
html_attrs() %>%
bind_rows() %>%
mutate_all(na_if,"")
########################
##### process url 2 #####
html2 = url2 %>%
read_html()
classAttrs_2 = html2 %>%
html_nodes('div') %>%
html_attrs() %>%
bind_rows() %>%
mutate_all(na_if,"")
########################
##### process url 3 #####
html3 = url3 %>%
read_html()
classAttrs_3 = html3 %>%
html_nodes('div') %>%
html_attrs() %>%
bind_rows() %>%
mutate_all(na_if,"")
########################
The lengths of each of the collected URLS can change, i.e.
> length(unique(classAttrs_1$class))
[1] 113
> length(unique(classAttrs_2$class))
[1] 114
> length(unique(classAttrs_3$class))
[1] 115
So I thought about treating each of the data frame individually.
CodePudding user response:
We may use rowwise
, check if the value in 'class' is non NA, apply the code and create a list
column (else return NA)
library(rvest)
library(dplyr)
library(stringr)
classAttrs_3_new <- classAttrs_3 %>%
rowwise %>%
mutate(new = list(if(is.na(class)) NA else html3 %>%
html_nodes(str_c(".", class)))) %>%
ungroup
-output
> head(classAttrs_3_new$new)
[[1]]
[1] NA
[[2]]
{xml_nodeset (1)}
[1] <div ><div >\n<div ><div id="PubX65Detail_wrapper" re-AdTop1Container-block">\n<div ><div id="PubX65Detail_wrapper" re-AdvertisingDominanceCrossdevice-x65"><div id="PubX65Detail_wrapper" ><div id="PubX65Detail"></div></div ...
[[5]]
{xml_nodeset (0)}
[[6]]
[1] NA
Or another option is map
library(purrr)
pfun_node <- possibly(function(html_obj, node_val )
html_obj %>% html_nodes(node_val), otherwise = NA)
classAttrs_3$new <- map(str_c(".", classAttrs_3$class), ~ pfun_node(html3, .x))