Home > Back-end >  rvest - select only certain hrefs under class
rvest - select only certain hrefs under class

Time:10-27

Objective

Scrape a vector of file paths to retail store locations, while ignoring the hyperlinked telephone number. I am new to working with html elements.

What I have tried

library(rvest)
library(tidyverse)
library(xml2)

store.paths <- read_html("https://www.walmart.com/store/directory/al/alabaster") %>%
    html_nodes(xpath = '//*[@class="store-directory-container"]') %>% 
    html_nodes("a") %>% 
    html_attr('href') 

which yields

[1] "/store/4756"      "tel:205-624-6229" "/store/423"       "tel:205-620-0360"

while my desire output is

[1] "/store/4756"  "/store/423"

I have tried replacing store-directory-container with storeBanner and the result is empty.

Thanks!

CodePudding user response:

It looks like the a tags you want also have the class storeBanner while the telephone links do not. It would be easy to grab them with

store.paths <- read_html("https://www.walmart.com/store/directory/al/alabaster") %>%
  html_elements("a.storeBanner") %>% 
  html_attr('href') 

I also used the CSS selector syntax in this case because it's easier and use the recommend html_elements function because html_nodes is soft-deprecated. You can't just replace "store-directory-container" with "storeBanner" because the the "a" tag is below the "store-directory-container" but in the case of "storeBanner" it is that element, not a child of that element.

CodePudding user response:

You can add one more xpath with storeBanner after tag

store.paths <- read_html("https://www.walmart.com/store/directory/al/alabaster") %>%
  html_nodes(xpath = '//*[@]') %>% 
  html_nodes("a") %>% 
  html_nodes(xpath = '//*[@]') %>% 
  html_attr('href') 

store.paths

[1] "/store/4756" "/store/423" 
  • Related