Home > Software design >  How to scrape a user review when each review has separate div id?
How to scrape a user review when each review has separate div id?

Time:09-14

I'm trying to scrape reviews from team-bhp.com. However, I noticed that each user review has a separate div id

  1. xpath is of form: //*[@id="post_message_4655182"]
  2. html is of form: <div id="post_message_4655182">
    I'm open to using any library like bs4 or lxml, but I prefer python. My code:
import requests
from lxml import html

url = 'https://www.team-bhp.com/forum/luxury-imports-niche/213083-looking-buying-bmw-x1-need-advice.html'

path = '///*[@id="post_message_4657893"]'
response = requests.get(url)
byte_data = response.content
source_code = html.fromstring(byte_data)

tree = source_code.xpath(path) 

print(tree[0].text_content())

This gives the proper output like: Hi Hajaar, We recently closed a deal for a BMW X1. Here are a few things I would like to share: Bargain hard...
But here I have hard coded the specific comment id. How to extract all reviews from a single page?

CodePudding user response:

Adjust your XPATH and use starts-with() to get your goal:

path = '//*[starts-with(@id,"post_message_")]'

Example

import requests    
from lxml import html

url = 'https://www.team-bhp.com/forum/luxury-imports-niche/213083-looking-buying-bmw-x1-need-advice.html'

path = '//*[starts-with(@id,"post_message_")]'
source_code = html.fromstring(requests.get(url).content)

for e in source_code.xpath(path):
    print(e.text_content())

Or while tagged with BeautifulSoup:

import requests
from bs4 import BeautifulSoup

url = 'https://www.team-bhp.com/forum/luxury-imports-niche/213083-looking-buying-bmw-x1-need-advice.html'
soup = BeautifulSoup(requests.get(url).content)

for e in soup.select('[id^="post_message_"]'):
    print(e.get_text())
  • Related