How to parse raw html element in R or Python?-CodePudding

For instance in this website: https://www.amazon.com/Lexani-LXUHP-207-All-Season-Radial-Tire-245/dp/B07FFH8F9V/

So I say "inspect" and I find the element that I'm interested:

<span id="productTitle" >        Lexani LXUHP-207 Performance Radial Tire - 245/45R18 100W       </span>

Here's the deal, I want to copy the entire thing. Not just the "Lexani LXUHP-207 Performance Radial Tire - 245/45R18 100W" text title of the product. Can someone tell me how can I do this in beatifulsoup or rvest?

I am learning Python and R and I tried to dig it in but couldn't get a raw result.

CodePudding user response：

there will be problems with captcha on amazon, but if you beat it you can get what you want by

import requests
from bs4 import BeautifulSoup

the_entire_thing = BeautifulSoup(requests.get('https://www.amazon.com/Lexani-LXUHP-207-All-Season-Radial-Tire-245/dp/B07FFH8F9V/').text, 'lxml').find(id='productTitle')

CodePudding user response：

In R you can just convert the node to a character vector:

library(rvest)
html <- minimal_html('<span id="productTitle" >        Lexani LXUHP-207 Performance Radial Tire - 245/45R18 100W       </span>')
html_node <- html_element(html, "#productTitle") 
as.character(html_node)
#> [1] "<span id=\"productTitle\" class=\"a-size-large product-title-word-break\">        Lexani LXUHP-207 Performance Radial Tire - 245/45R18 100W       </span>"

^{Created on 2022-11-02 with reprex v2.0.2}