Selecting an element with multiple classes in python lxml using xpath-CodePudding

I was trying to scrape a website using python request and lxml. I could easily select the elements with single class using html.xpath() but I can't figure out how to select the elements with multiple class.

I used some code like this to select the elements in page with class "title":

page.xpath('//a[@]')

However, I couldn't select elements with multiple classes. I checked some few codes. I tried to study xpath but it seemes like lxml.html.xpath() works different, may be it's my lack of understanding. I tried few codes which didnt' work for me. They are given below.

HTML code

<a href="https://www.lovemycosmetic.de/skin1004-madagascar-centella-ampoule-30ml-"  title="SKIN1004 Madagascar Centella Ampoule 30ml"> <strong ><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">SKIN1004</font></font></strong><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">SKIN1004 Madagascar Centella Ampoule 30ml</font></font></a>

Test 1:

page.xpath('//a[@]')

Test 2:

page.xpath("//a[@class='info text-center']")

Test 3:

page.xpath('//a[@]')

Test 4:

page.xpath("//a[contains(@class, 'info') and contains(@class, 'text-center')]")

I did couple more tests too but I forgot to save the code. It will be great to know how to select elements with multiple classes using lxml.html.xpath().

CodePudding user response：

NB as far as XPath is concerned, the class attribute's value is a string like any other. It doesn't automatically parse the value as a list of space-delimited tokens, as a CSS selector would. In later versions of XPath you have the function contains-token() but lxml supports XPath 1.0 in which you basically have to tokenize the class value yourself.

If your class values are literally info text-center then you can test it with the predicate [@], but that won't match a class value of e.g. text-center info or info text-center foo bar. I'd recommend you use the XPath contains() function, e.g.

//a[contains(@class, "info")][contains(@class, "text-center")]

CodePudding user response：

Your test1 and test2 should both work fine, this is the code I used to get the results.

from lxml.html import etree
root = etree.fromstring('<a href="https://www.lovemycosmetic.de/skin1004-madagascar-centella-ampoule-30ml-"  title="SKIN1004 Madagascar Centella Ampoule 30ml"> <strong ><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">SKIN1004</font></font></strong><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">SKIN1004 Madagascar Centella Ampoule 30ml</font></font></a>')
elem = root.xpath('//a[@]')[0]
url = elem.xpath('./@href')[0]
print(elem, url)

OUTPUT:

<Element a at 0x1ef01509940> https://www.lovemycosmetic.de/skin1004-madagascar-centella-ampoule-30ml-