I an writting a Paython script to scrap a website, and I get null output when I try to get an specific class.
The block is:
<div > == $0
::before
<!-- /cache: pl_class_46761{nULE0} -->
<div>
<h3 class= Title">...</div>
... etc, the rest of items
And the .py is:
from bs4 import BeautifulSoup
import requests
baseurl = 'htps://www.list_of_brands.php'
headers = {
'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
}
r = requests.get('https://www.the_first_page_of_a_brand.html')
soup = BeautifulSoup(r.content, 'lxml')
productlist = soup.find_all('div', class_='prdt Product')
print(productlist)
And what I get printed is just []
I can't find where is my error... maybe something related with == $0 ?? Because it seems that it doesn't pick the container properly.
Thank you!
CodePudding user response:
I believe your parser might be the issue. When I run your code with https://www.maquillalia.com/apieu-m-406.html, I also don't get anything until I change the parser to html.parser
- which gives me one tag in productlist
but there is a warning message; the warning goes away if I use html5lib
soup = BeautifulSoup(r.content, 'html5lib')
with the above change, it prints
[<div ><!-- /cache: pl_class_46761{BPCwo} --><div><h3 ><a href="https://www.maquillalia.com/apieu-mascarilla-icing-sweet-bar-sheet-mask-sandia-p-46761.html">A'pieu - Mascarilla Icing Sweet Bar sheet Mask - Sandía</a></h3><div ><figure><a href="https://www.maquillalia.com/apieu-mascarilla-icing-sweet-bar-sheet-mask-sandia-p-46761.html"><img alt="A'pieu - Mascarilla Icing Sweet Bar sheet Mask - Sandía" border="0" height="220" src="images/productos/thumbnails/a-pieu-mascarilla-icing-sweet-bar-sheet-mask-sandia-1-46761_thumb_220x220.jpg" title="A'pieu - Mascarilla Icing Sweet Bar sheet Mask - Sandía" width="220"/></a></figure></div><div >Mascarilla anatómica de algodón con vitaminas y oligoelementos que hidratan y recuperan la piel dañada.
Con extracto de Sandía que hidrataa y cuida la piel.
<a href="https://www.maquillalia.com/apieu-mascarilla-icing-sweet-bar-sheet-mask-sandia-p-46761.html">Ver </a></div><div ><div data-price="1.90"><strong>1,90€</strong></div><div ><span data-rating="5.00"><span title="5"></span><span title="4"></span><span title="3"></span><span title="2"></span><span title="1"></span><span style="width: 100%"></span></span><span >(3)</span></div><div ><!-- cache: pl_boton_46761{BPCwo} --><a data-atribute="" data-href="https://www.maquillalia.com/apieu-mascarilla-icing-sweet-bar-sheet-mask-sandia-p-46761.html" data-id="46761" data-qty="6" href="javascript:void(0);" rel="nofollow" title="Comprar A'pieu - Mascarilla Icing Sweet Bar sheet Mask - Sandía">Comprar<span style="position: absolute;top: 0;left: 0;width:100%;height: 100%;"></span></a><!-- /cache: pl_boton_46761{BPCwo} --><a data-cid="0" data-list="" data-login="0" data-pid="46761" href="javascript:void(0);"></a><div ><span>IVA Incl.</span><span>Precio por 100 Gr: 9,05€</span></div></div></div></div></div>]
Btw, to find any div
with both prdt
and Product
classes, but not necessarily just those, you can use
soup.find_all('div', {'class':'prdt', 'class':'Product'})
or preferably
soup.select('div.prdt.Product')
and then divs with class prdt Product Agotado
,Product prdt
,etc will also be included.