Home > other >  In R/rvest, how to get href information ( the linkage following click text)
In R/rvest, how to get href information ( the linkage following click text)

Time:06-16

In R/rvest, as below code , I can run the html_text(), but when i run want to get the linkage following every text web %>% html_node("div.p13n-desktop-grid") %>% html_attr(name='href') failed .Anyone can help? Thanks!

enter image description here

library(rvest)
url <- "https://www.amazon.com/Best-Sellers-Industrial-Scientific-3D-Printers/zgbs/industrial/6066127011/ref=zg_bs_pg_1?_encoding=UTF8&pg=1"
web <- rvest::read_html(url)
web %>% html_node("div.p13n-desktop-grid") %>% html_text() %>% strsplit("#") # ok
web %>% html_node("div.p13n-desktop-grid") %>%  html_attr(name='href') # want to get the linkage following the click text, but failed

CodePudding user response:

The href attribute is an attribute of the a tags. Not clear which one you want, there are 119 href found:

web %>% 
  html_node("div.p13n-desktop-grid") %>% 
  html_elements("a") %>%
  html_attr(name = 'href') 
#   [1] "/Comgrow-Creality-Ender-Aluminum-220x220x250mm/dp/B07BR3F9N6/ref=zg_bs_6066127011_1/132-1194669-0063960?pd_rd_i=B07BR3F9N6&psc=1"                                
#   [2] "/Comgrow-Creality-Ender-Aluminum-220x220x250mm/dp/B07BR3F9N6/ref=zg_bs_6066127011_1/132-1194669-0063960?pd_rd_i=B07BR3F9N6&psc=1"                                
#   [3] "/product-reviews/B07BR3F9N6/ref=zg_bs_6066127011_cr_1/132-1194669-0063960?pd_rd_i=B07BR3F9N6"                                                                    
#   [4] ......

CodePudding user response:

For (shortened) product links and link texts:

library(rvest)
library(dplyr)

url <- "https://www.amazon.com/Best-Sellers-Industrial-Scientific-3D-Printers/zgbs/industrial/6066127011/ref=zg_bs_pg_1?_encoding=UTF8&pg=1"
web <- rvest::read_html(url)

# "div.p13n-desktop-grid a[tabindex]   a" : 
# text links are adjacent siblings of image links & image links have tabindex attribute

prod_links <- web %>% html_elements("div.p13n-desktop-grid a[tabindex]   a")
tibble(
  # shorten links, keep only /pb/item_id/ part
  href =  prod_links %>% html_attr(name='href') %>% sub('.*(/dp/\\w*/).*','www.amazon.com\\1', .),
  descr = prod_links %>% html_text2()
)
#> # A tibble: 30 × 2
#>    href                          descr                                          
#>    <chr>                         <chr>                                          
#>  1 www.amazon.com/dp/B07BR3F9N6/ Official Creality Ender 3 3D Printer Fully Ope…
#>  2 www.amazon.com/dp/B07FFTHMMN/ Official Creality Ender 3 V2 3D Printer Upgrad…
#>  3 www.amazon.com/dp/B09QGTTQKG/ ANYCUBIC Kobra 3D Printer Auto Leveling, FDM 3…
#>  4 www.amazon.com/dp/B07GYRQVYV/ Official Creality Ender 3 Pro 3D Printer with …
#>  5 www.amazon.com/dp/B083GTS8XJ/ ANYCUBIC Wash and Cure Station, Newest Upgrade…
#>  6 www.amazon.com/dp/B09FXYSFBV/ ANYCUBIC Photon Mono 4K 3D Printer, 6.23'' Mon…
#>  7 www.amazon.com/dp/B07J9QGP7S/ ANYCUBIC Mega-S New Upgrade 3D Printer with Hi…
#>  8 www.amazon.com/dp/B07Z9C9T42/ ELEGOO 5PCs FEP Release Film Mars LCD 3D Print…
#>  9 www.amazon.com/dp/B08SPXYND4/ Voxelab Aquila 3D Printer with Full Alloy Fram…
#> 10 www.amazon.com/dp/B07DYL9B2S/ Official Creality Ender 3 S1 3D Printer with D…
#> # … with 20 more rows

Created on 2022-06-16 by the reprex package (v2.0.1)

  • Related