The copied CSS selector from the browser returns a different result using BeautifulSoup4 in Python-CodePudding

Usually when I want to scrape a particular text from a website, I right click the text and select inspect. Then in the HTML code, I look for the text I am interested in and right-click -> 'copy' -> 'copy selector'.

Then I paste that string of text I just copied within soup.select('enter copied text here') and save it to a variable. I can then perform text stripping functions to get the key text I need.

Now for the situation I am working with, I want to get the total number of cars shown on this webpage in the header h1: carsales.com.au/cars/used/toyota/rav4/. As of now the number is 1712.

This is my code:

import requests
from bs4 import BeautifulSoup as bs

url = "https://www.carsales.com.au/cars/used/toyota/rav4/"
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.41 Safari/537.36'}

res = requests.get(url,headers = headers)
res.raise_for_status()

# # prints entire website
# print(res.text)

# # if this gives 200, then you're good to go.
#print(res.status_code)

soup = bs(res.text, 'html.parser')

# # This one gets how many cars are available from the search link. 
# # This is the alternate way as the soup.select method is not working.
# header_h1 = soup.find_all('h1')
# print(header_h1) 


total_cars_element = soup.select('body > div.listing > div.container.listing-container.has-header-sticky > div.row.flex-nowrap.no-gutters > div:nth-child(1) > div:nth-child(1) > div')

print(total_cars_element)
# the above prints an empty list.

I really just want to know why this is not working. I understand there are other work arounds as I have mentioned in the code above. But I really want to stick with the soup.select method.

Any insights are much appreciated! Thanks!

CodePudding user response：

The issue stems from the fact that the HTML fetched via Python is not the same as the one that gets generated in your browser. Try printing soup and see for yourself.

One particular tag, which is part of your query, is troublesome. In the browser, it looks like this:

<div >

but your Python code sees this instead:

<div >

Change your selector to:

body > div.listing > div.container.listing-container > div.row.flex-nowrap.no-gutters > div:nth-child(1) > div:nth-child(1) > div

and you'll get the expected result.

This behaviour is considered normal since the page you're trying to scrape is dynamic. That means that JavaScript adds or removes certain parts of the original HTML page after the page loads.

If you want to scrape a dynamic web page using Python, you'll need something more than just Beautiful Soup. See https://scrapingant.com/blog/scrape-dynamic-website-with-python for more info on that subject.

CodePudding user response：

with @Janez Kuhar nice Answer, You could also use

total_cars_element = soup.select('h1.title')
print(total_cars_element[0].text)

more about CSS Select