Home > database >  BeautifulSoup: getting child of div container
BeautifulSoup: getting child of div container

Time:10-07

I am trying to get the price and odometer reading for cars listed on a carsale site, in order to monitor when a specific model was listed and when it disappeared. A page may return 1 or many cars. I am new to both python and BeautifulSoup, and have most likely bitten off more than I can chew.

I managed to request the page, and find the div containers, each with details for one car.

I can iterate through the list of cars, but cannot address/extract subsequent tags for each car.

# import libraries
from bs4 import BeautifulSoup
import requests
# Request to website and download HTML contents
url = 'https://www.carsales.com.au/cars/2011/mercedes-benz/s-class/s350-badge/'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36'}

response = requests.get(url, headers=headers)
response_code = response.status_code

if response_code != 200:
    print(f"Error fetching page: {response_code}")
    exit()
else:
    content = response.content

soup = BeautifulSoup(content, 'html.parser')

# <div class="card-body">
SELECTOR_CAR = "card-body"

# <a class="js-encode-search" data-webm-clickvalue="sv-price" href="/cars/details/2011-mercedes-benz-s-class-s350-auto-my10/OAG-AD-19752647/?Cr=8">$40,990* <span class="currency"></span></a>
SELECTOR_PRICE = ""

# <ul class="key-details">
#   <li class="key-details__value" data-type="Odometer">95,121 km</li>
SELECTOR_ODO = ""

# find all cars on page
# class is a python reserved work; use class_ instead
cars = soup.find_all(class_ = SELECTOR_CAR)

# ----- my original version
formatted_cars = []     # array for car details

for car in cars:
    print("==========")
    data = {
        'title': car('js-encode-search'),
        'price': car('key-details__value')
    }
    formatted_cars.append(data)
    #car_soup = BeautifulSoup(car, 'html.parser')
    #print(car_card.prettify)
    #print(car_card)

print(formatted_cars)
# ----- end original

# ----- modified later
for car in cars:
    print("==========")
    for child in car.a.children:
        print(child)

    car_odo = car.li.contents
    print(car_odo)
# ----- modified later end

Results [from the modified version of the 'for'] in:

python3 getCarsales_S350.py 
9 Mercedes-Benz S-Class S350 cars for sale in Australia
9
==========
2009 Mercedes-Benz S-Class S350 Auto MY08
['181,150 km']
==========
2010 Mercedes-Benz S-Class S350 Auto MY10
['291,153 km']
==========
2010 Mercedes-Benz S-Class S350 Auto MY10
['192,851 km']
==========
2010 Mercedes-Benz S-Class S350 Auto MY10
['78,606 km']
==========
2010 Mercedes-Benz S-Class S350 Auto MY10
['38,806 km']
==========
2010 Mercedes-Benz S-Class S350 Auto MY10
['172,012 km']
==========
2010 Mercedes-Benz S-Class S350 L Auto MY10
['77,800 km']
==========
2010 Mercedes-Benz S-Class S350 Auto MY10
['143,000 km']
==========
2011 Mercedes-Benz S-Class S350 Auto MY10
['95,121 km']

... which works by accident, rather than specifics, evidenced with being unable to get the price. Odo and title just happen to be the first elements.

Here a single car container:

<div class="card-body">
    <div class="row">
        <div class="col">
            <h3>
                <a class="js-encode-search" data-webm-clickvalue="sv-title"
                    href="/cars/details/2011-mercedes-benz-s-class-s350-auto-my10/OAG-AD-19752647/?Cr=8">2011
                    Mercedes-Benz S-Class S350 Auto MY10</a>
            </h3>
        </div>
        <div class="col-12 col-xl-5 text-right">
            <div class="item-price">
                <div class="price">
                    <a class="js-encode-search" data-webm-clickvalue="sv-price"
                        href="/cars/details/2011-mercedes-benz-s-class-s350-auto-my10/OAG-AD-19752647/?Cr=8">$40,990*
                        <span class="currency"></span></a>
                </div>
                <div class="price-info-container">
                    <a class="price-info" data-target-url="/_details/api/v1/price-guide/carsales/OAG-AD-19752647"
                        data-toggle="lightbox" data-webm-clickvalue="sv-price-label">
                        Excl. Govt. Charges
                    </a>
                    <a class="additional-price-info"
                        data-target-url="/_details/api/v1/price-guide/carsales/OAG-AD-19752647"
                        data-toggle="lightbox"></a>
                </div>
            </div>
        </div>
    </div>
    <div class="row">
        <div class="col">
            <ul class="key-details">
                <li class="key-details__value" data-type="Odometer">95,121 km</li>
                <li class="key-details__value" data-type="Body Style">Sedan</li>
                <li class="key-details__value" data-type="Transmission">Automatic</li>
                <li class="key-details__value" data-type="Engine">6cyl 3.5L Petrol</li>
            </ul>
            <a class="xfacts-report" data-lightbox-height="650" data-lightbox-onclosed="onFactsPlusModalClosed"
                data-lightbox-width="900" data-opm-event="click-facts-driver-listings"
                data-opm-exp="facts-driver-listings" data-opm-trackon="click" data-seller-type="dealer"
                data-smart-buyer-network-id="OAG-AD-19752647"
                data-target-url="/smartbuyer/popup?networkId=OAG-AD-19752647&amp;sourcesystem=desktop.carsales-dealer.listing-carfacts.buy.textlink&amp;driver_crosssell=desktop.carsales-dealer.listing-carfacts.buy.textlink"
                data-toggle="lightbox" data-webm-clickvalue="get-carfacts-report">
                Pricing &amp; history on this car - FACTS 
            </a>
        </div>
        <div class="col-12 col-xl-4 text-right d-flex align-items-start badge-csn">
        </div>
    </div>
</div>

CodePudding user response:

What happens

There are multiple tags containing class js-encode-search and you try to find_all() of them.

How to fix

Make your selector more specific, cause the title is placed in <a> of a parent <h3>

soup.select_one('h3 a')

Example

soup = BeautifulSoup(content, 'html.parser')

formatted_cars = []     # array for car details

for car in cars:
    print("==========")
    data = {
        'title': ' '.join(soup.select_one('h3 a').get_text(strip=True).split()),
        'price': soup.select_one('div.price a').get_text(strip=True)
    }
    formatted_cars.append(data)

print(formatted_cars)

Output

==========
[{'title': '2011 Mercedes-Benz S-Class S350 Auto MY10', 'price': '$40,990*'}]

CodePudding user response:

The selected answer is correct for one car. To get the all cars the for loop needs to look like this:

        formatted_cars = []     # array for car details

        for car in cars:
            print("==========")
            data = {
                'title': ' '.join(car.select_one('h3 a').get_text(strip=True).split()),
                'price': car.select_one('div.price a').get_text(strip=True),
                'odo': car.select_one('ul.key-details li').get_text(strip=True)
            }
            #print(data)
            formatted_cars.append(data)

        print(formatted_cars)

The soup-reference is car of cars not the soup. (hope this makes sense)

  • Related