Home > Enterprise >  Web scraping struggles of a category ending in == 0$
Web scraping struggles of a category ending in == 0$

Time:11-03

I an writting a Paython script to scrap a website, and I get null output when I try to get an specific class.

The block is:

<div > == $0
    ::before
    <!-- /cache: pl_class_46761{nULE0} -->
    <div>
        <h3 class= Title">...</div>
        ... etc, the rest of items

And the .py is:

from bs4 import BeautifulSoup
import requests

baseurl = 'htps://www.list_of_brands.php'

headers = {
        'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
}

r = requests.get('https://www.the_first_page_of_a_brand.html')
soup = BeautifulSoup(r.content, 'lxml')

productlist = soup.find_all('div', class_='prdt Product')

print(productlist)

And what I get printed is just []

I can't find where is my error... maybe something related with == $0 ?? Because it seems that it doesn't pick the container properly.

Thank you!

CodePudding user response:

I believe your parser might be the issue. When I run your code with https://www.maquillalia.com/apieu-m-406.html, I also don't get anything until I change the parser to html.parser - which gives me one tag in productlist but there is a warning message; the warning goes away if I use html5lib

soup = BeautifulSoup(r.content, 'html5lib')

with the above change, it prints

[<div ><!-- /cache: pl_class_46761{BPCwo} --><div><h3 ><a href="https://www.maquillalia.com/apieu-mascarilla-icing-sweet-bar-sheet-mask-sandia-p-46761.html">A'pieu - Mascarilla Icing Sweet Bar sheet Mask - Sandía</a></h3><div ><figure><a href="https://www.maquillalia.com/apieu-mascarilla-icing-sweet-bar-sheet-mask-sandia-p-46761.html"><img alt="A'pieu - Mascarilla Icing Sweet Bar sheet Mask - Sandía" border="0"  height="220" src="images/productos/thumbnails/a-pieu-mascarilla-icing-sweet-bar-sheet-mask-sandia-1-46761_thumb_220x220.jpg" title="A'pieu - Mascarilla Icing Sweet Bar sheet Mask - Sandía" width="220"/></a></figure></div><div >Mascarilla anatómica de algodón con vitaminas y oligoelementos que hidratan y recuperan la piel dañada.

Con extracto de Sandía que hidrataa y cuida la piel.

 <a  href="https://www.maquillalia.com/apieu-mascarilla-icing-sweet-bar-sheet-mask-sandia-p-46761.html">Ver  </a></div><div ><div  data-price="1.90"><strong>1,90€</strong></div><div ><span  data-rating="5.00"><span  title="5"></span><span  title="4"></span><span  title="3"></span><span  title="2"></span><span  title="1"></span><span  style="width: 100%"></span></span><span >(3)</span></div><div ><!-- cache: pl_boton_46761{BPCwo} --><a  data-atribute="" data-href="https://www.maquillalia.com/apieu-mascarilla-icing-sweet-bar-sheet-mask-sandia-p-46761.html" data-id="46761" data-qty="6" href="javascript:void(0);" rel="nofollow" title="Comprar A'pieu - Mascarilla Icing Sweet Bar sheet Mask - Sandía">Comprar<span style="position: absolute;top: 0;left: 0;width:100%;height: 100%;"></span></a><!-- /cache: pl_boton_46761{BPCwo} --><a  data-cid="0" data-list="" data-login="0" data-pid="46761" href="javascript:void(0);"></a><div ><span>IVA Incl.</span><span>Precio por 100 Gr: 9,05€</span></div></div></div></div></div>]

Btw, to find any div with both prdt and Product classes, but not necessarily just those, you can use

soup.find_all('div', {'class':'prdt', 'class':'Product'})

or preferably

soup.select('div.prdt.Product')

and then divs with class prdt Product Agotado,Product prdt,etc will also be included.

  • Related