Home > Back-end >  rowwise apply rvest html_nodes() and store in a new column the output
rowwise apply rvest html_nodes() and store in a new column the output

Time:04-10

I have some urls that I would like to scrape. I end up with 3 data frames (for example):

# A tibble: 255 × 7
   id                   class                                                                                                                            tabindex role  `aria-controls` style `data-testid`
   <chr>                <chr>                                                                                                                            <chr>    <chr> <chr>           <chr> <chr>        
 1 App                  NA                                                                                                                               NA       NA    NA              NA    NA           
 2 NA                   re-AdTop1Container                                                                                                               NA       NA    NA              NA    NA           
 3 NA                   re-AdTop1Container-block                                                                                                         NA       NA    NA              NA    NA           
 4 NA                   re-AdvertisingDominanceCrossdevice-x65                                                                                           NA       NA    NA              NA    NA           
 5 PubX65Detail_wrapper adit-XandrBanner adit-XandrBanner--notAvailable                                                                                  NA       NA    NA              NA    NA           
 6 PubX65Detail         NA                                                                                                                               NA       NA    NA              NA    NA           
 7 NA                   re-AdvertisingDominanceCrossdevice-top1                                                                                          NA       NA    NA              NA    NA           
 8 PubTop1_wrapper      adit-XandrBanner adit-XandrBanner--notAvailable                                                                                  NA       NA    NA              NA    NA           
 9 PubTop1              NA                                                                                                                               NA       NA    NA              NA    NA           
10 NA                   react-MoleculeDrawer-content react-MoleculeDrawer-content--placement-left react-MoleculeDrawer-content--size-auto react-Molecul… NA       NA    NA              NA    NA

I want to go over each row in the class column and store in a new column the collected data.

i.e. I can manually collect the data using:

html3 %>% 
  html_nodes('.re-DetailHeader-propertyTitleContainer')

but I would like to preserve the "structure" of the rvest collected data. I want to create a new column and keep all the saved html_nodes() using the classes in the column class.

Code:

url1 = "https://www.fotocasa.es/es/comprar/vivienda/madrid-capital/terraza-piscina/163103410/d"
url2 = "https://www.fotocasa.es/es/comprar/vivienda/elche---elx/calefaccion-terraza-ascensor-parking-internet-no-amueblado/162434119/d"
url3 = "https://www.fotocasa.es/es/comprar/vivienda/almoradi/terraza-trastero-ascensor-amueblado/163000099/d"



##### process url 1 #####
html1 = url1 %>% 
  read_html()


classAttrs_1 = html1 %>% 
  html_nodes('div') %>% 
  html_attrs() %>% 
  bind_rows() %>% 
  mutate_all(na_if,"")

########################

##### process url 2 #####
html2 = url2 %>% 
  read_html()


classAttrs_2 = html2 %>% 
  html_nodes('div') %>% 
  html_attrs() %>% 
  bind_rows() %>% 
  mutate_all(na_if,"")

########################

##### process url 3 #####
html3 = url3 %>% 
  read_html()


classAttrs_3 = html3 %>% 
  html_nodes('div') %>% 
  html_attrs() %>% 
  bind_rows() %>% 
  mutate_all(na_if,"")

########################

The lengths of each of the collected URLS can change, i.e.

> length(unique(classAttrs_1$class))
[1] 113
> length(unique(classAttrs_2$class))
[1] 114
> length(unique(classAttrs_3$class))
[1] 115

So I thought about treating each of the data frame individually.

CodePudding user response:

We may use rowwise, check if the value in 'class' is non NA, apply the code and create a list column (else return NA)

library(rvest)
library(dplyr)
library(stringr)
classAttrs_3_new <- classAttrs_3 %>%  
   rowwise %>%
   mutate(new = list(if(is.na(class)) NA else html3 %>%
   html_nodes(str_c(".", class)))) %>% 
   ungroup

-output

> head(classAttrs_3_new$new)
[[1]]
[1] NA

[[2]]
{xml_nodeset (1)}
[1] <div ><div >\n<div ><div id="PubX65Detail_wrapper" re-AdTop1Container-block">\n<div ><div id="PubX65Detail_wrapper" re-AdvertisingDominanceCrossdevice-x65"><div id="PubX65Detail_wrapper" ><div id="PubX65Detail"></div></div ...

[[5]]
{xml_nodeset (0)}

[[6]]
[1] NA

Or another option is map

library(purrr)
pfun_node <- possibly(function(html_obj, node_val ) 
      html_obj %>% html_nodes(node_val), otherwise = NA)
classAttrs_3$new <- map(str_c(".", classAttrs_3$class), ~ pfun_node(html3, .x))
  • Related