I was trying to scrape a website using python request and lxml. I could easily select the elements with single class using html.xpath()
but I can't figure out how to select the elements with multiple class.
I used some code like this to select the elements in page with class "title":
page.xpath('//a[@]')
However, I couldn't select elements with multiple classes. I checked some few codes. I tried to study xpath but it seemes like lxml.html.xpath()
works different, may be it's my lack of understanding. I tried few codes which didnt' work for me. They are given below.
HTML code
<a href="https://www.lovemycosmetic.de/skin1004-madagascar-centella-ampoule-30ml-" title="SKIN1004 Madagascar Centella Ampoule 30ml"> <strong ><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">SKIN1004</font></font></strong><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">SKIN1004 Madagascar Centella Ampoule 30ml</font></font></a>
Test 1:
page.xpath('//a[@]')
Test 2:
page.xpath("//a[@class='info text-center']")
Test 3:
page.xpath('//a[@]')
Test 4:
page.xpath("//a[contains(@class, 'info') and contains(@class, 'text-center')]")
I did couple more tests too but I forgot to save the code. It will be great to know how to select elements with multiple classes using lxml.html.xpath()
.
CodePudding user response:
NB as far as XPath is concerned, the class
attribute's value is a string like any other. It doesn't automatically parse the value as a list of space-delimited tokens, as a CSS selector would. In later versions of XPath you have the function contains-token()
but lxml
supports XPath 1.0 in which you basically have to tokenize the class
value yourself.
If your class
values are literally info text-center
then you can test it with the predicate [@]
, but that won't match a class
value of e.g. text-center info
or info text-center foo bar
. I'd recommend you use the XPath contains()
function, e.g.
//a[contains(@class, "info")][contains(@class, "text-center")]
CodePudding user response:
Your test1 and test2 should both work fine, this is the code I used to get the results.
from lxml.html import etree
root = etree.fromstring('<a href="https://www.lovemycosmetic.de/skin1004-madagascar-centella-ampoule-30ml-" title="SKIN1004 Madagascar Centella Ampoule 30ml"> <strong ><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">SKIN1004</font></font></strong><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">SKIN1004 Madagascar Centella Ampoule 30ml</font></font></a>')
elem = root.xpath('//a[@]')[0]
url = elem.xpath('./@href')[0]
print(elem, url)
OUTPUT:
<Element a at 0x1ef01509940> https://www.lovemycosmetic.de/skin1004-madagascar-centella-ampoule-30ml-