Home > Net >  Python bs4 HTML content with no meaning
Python bs4 HTML content with no meaning

Time:09-26

I have this function that scrapes product listings for 2nd hand market website.

Sometimes the function doesn't work because the Html that it is working is different. For some the product Html is good and for other listings on the same page the Html is completly different like it was not loaded and the Html on my browser shows that both products have similiar Html structure

def cj_search(location='portugal', search_term='gtx 1080'):
    headers = {'user-agent': 'Mozilla/5.0'}
    page_num = 1
    while True:
        page = requests.get(f'https://www.custojusto.pt/{location}/q/{search_term}?o={page_num}&sp=1&st=a', headers=headers)
        soup = BeautifulSoup(page.text, 'lxml')
        products = soup.find_all('div', class_='container_related')

        if not products:
            break

        for product in products:
            print(product)
            # Get the data
            product_name = product.find('h2', class_='title_related').find('b').text
            product_price = float(product.find('h5', class_='price_related').text.strip()[:-2])
            product_link = product.find('a')['href']

            page_num = page_num   1
            print(f'Name: {product_name}\nPrice: {product_price}€\nLink: {product_link}\n')

Most of the times I get something with meaning like this:

<div class="container_related" data-cf-modified-49b0b69ad2ce6b4b7dae67af-="" onclick="...">
<a data-cf-modified-49b0b69ad2ce6b4b7dae67af-="" data-name="url" href="https://www.custojusto.pt/acores/informatica/informatica-acessorios/portatil-asus-gaming-34466267" id="34466267" onclick="if (!window.__cfRLUnblockHandlers) return false; javascript:window.event.preventDefault();">
<input name="list_id" type="hidden" value="34466267"/>
<div class="row results results_listing">
...
<div class="col-md-10 col-xs-7">
<h2 class="no-padding no-margin col-md-10 col-sm-10 words-all title_related" style="text-align:left;">
<b>Portátil ASUS Gaming</b>
</h2>
<h5 class="col-md-2 col-sm-2 col-xs-6 no-padding text-right pull-right price_related" style="right: 0!important;bottom: -2px;">
<span class="glyphicon glyphicon-arrow-down"></span>
500 €
</h5>
<div class="col-xs-12 no-margin description_related visible-xs">
Informática &amp; Acessórios
...

And other times I get this Html which often occurs on the middle/end of the product listings

<div class="container_related" data-engageya="ENGAGEYA2">
<div id="rcjsload_605436"></div>
</div>

EDIT (tried Selenium)

To create the soup I also tried the following but the same html didnt load

driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get(f'https://www.custojusto.pt/{location}/q/{search_term}?o={page_num}&sp=1&st=a')
sleep(10)
soup = BeautifulSoup(driver.page_source, 'lxml')

CodePudding user response:

the id of the second div, rcjsload, suggests that this is some JavaScript generated content. bs4 will allow you to parse HTML, but won't be of any help when you have to also parse JavaScript generated content.

Did you try using selenium? It is a very powerful module which runs a browser for you. You basically remote control a browser with code, and you can use it for testing as well as for webscraping. This will allow you to interact with the page itself as if you were using a browser.

EDIT:

I got it wrong, there is in fact no JS generated content in that div, but you can just skip it if necessary. You can indeed do this with BeautifulSoup. Finding a partial id or class match with regex. So import re and change your function. The key is adding a snippet to look for the div that is causing you issues

product.find('div', attrs={'id': re.compile('^rcjsload_.*')})

def cj_search(location='portugal', search_term='gtx 1080'):
    headers = {'user-agent': 'Mozilla/5.0'}
    page_num = 1
    while True:
        page = requests.get(f'https://www.custojusto.pt/{location}/q/{search_term}?o={page_num}&sp=1&st=a', headers=headers)
        soup = BeautifulSoup(page.text, 'lxml')
        products = soup.find_all('div', class_='container_related')

        if not products:
            break

        for product in products:
            # print(product)
            # Get the data
            if product is None:
                continue
            elif product.find('div', attrs={'id': re.compile('^rcjsload_.*')}):
                print('Skipping this div')
                continue
            else:
                try:
                    product_name = product.find('h2', class_='title_related').find('b').text
                    product_price = product.find('h5', class_='price_related').text.strip().replace(' ', '')
                    product_link = product.find('a')['href']
                except:
                    product_name = 'Unknown'
                    product_price = 0
                    product_link = 'Unknown'
                
                page_num = page_num   1
                print(f'Name: {product_name}\nPrice: {product_price}€\nLink: {product_link}\n')
  • Related