I'm trying to scrape reviews from team-bhp.com. However, I noticed that each user review has a separate div id
- xpath is of form:
//*[@id="post_message_4655182"]
- html is of form:
<div id="post_message_4655182">
I'm open to using any library like bs4 or lxml, but I prefer python. My code:
import requests
from lxml import html
url = 'https://www.team-bhp.com/forum/luxury-imports-niche/213083-looking-buying-bmw-x1-need-advice.html'
path = '///*[@id="post_message_4657893"]'
response = requests.get(url)
byte_data = response.content
source_code = html.fromstring(byte_data)
tree = source_code.xpath(path)
print(tree[0].text_content())
This gives the proper output like:
Hi Hajaar,
We recently closed a deal for a BMW X1. Here are a few things I would like to share:
Bargain hard...
But here I have hard coded the specific comment id. How to extract all reviews from a single page?
CodePudding user response:
Adjust your XPATH
and use starts-with()
to get your goal:
path = '//*[starts-with(@id,"post_message_")]'
Example
import requests
from lxml import html
url = 'https://www.team-bhp.com/forum/luxury-imports-niche/213083-looking-buying-bmw-x1-need-advice.html'
path = '//*[starts-with(@id,"post_message_")]'
source_code = html.fromstring(requests.get(url).content)
for e in source_code.xpath(path):
print(e.text_content())
Or while tagged with BeautifulSoup
:
import requests
from bs4 import BeautifulSoup
url = 'https://www.team-bhp.com/forum/luxury-imports-niche/213083-looking-buying-bmw-x1-need-advice.html'
soup = BeautifulSoup(requests.get(url).content)
for e in soup.select('[id^="post_message_"]'):
print(e.get_text())