Home > database >  How to parse raw html element in R or Python?
How to parse raw html element in R or Python?

Time:11-02

For instance in this website: https://www.amazon.com/Lexani-LXUHP-207-All-Season-Radial-Tire-245/dp/B07FFH8F9V/

So I say "inspect" and I find the element that I'm interested:

<span id="productTitle" >        Lexani LXUHP-207 Performance Radial Tire - 245/45R18 100W       </span>

Here's the deal, I want to copy the entire thing. Not just the "Lexani LXUHP-207 Performance Radial Tire - 245/45R18 100W" text title of the product. Can someone tell me how can I do this in beatifulsoup or rvest?

I am learning Python and R and I tried to dig it in but couldn't get a raw result.

CodePudding user response:

there will be problems with captcha on amazon, but if you beat it you can get what you want by

import requests
from bs4 import BeautifulSoup

the_entire_thing = BeautifulSoup(requests.get('https://www.amazon.com/Lexani-LXUHP-207-All-Season-Radial-Tire-245/dp/B07FFH8F9V/').text, 'lxml').find(id='productTitle')

CodePudding user response:

In R you can just convert the node to a character vector:

library(rvest)
html <- minimal_html('<span id="productTitle" >        Lexani LXUHP-207 Performance Radial Tire - 245/45R18 100W       </span>')
html_node <- html_element(html, "#productTitle") 
as.character(html_node)
#> [1] "<span id=\"productTitle\" class=\"a-size-large product-title-word-break\">        Lexani LXUHP-207 Performance Radial Tire - 245/45R18 100W       </span>"

Created on 2022-11-02 with reprex v2.0.2

  • Related