Home > Blockchain >  Selecting an element with multiple classes in python lxml using xpath
Selecting an element with multiple classes in python lxml using xpath

Time:12-18

I was trying to scrape a website using python request and lxml. I could easily select the elements with single class using html.xpath() but I can't figure out how to select the elements with multiple class.

I used some code like this to select the elements in page with class "title":

page.xpath('//a[@]')

However, I couldn't select elements with multiple classes. I checked some few codes. I tried to study xpath but it seemes like lxml.html.xpath() works different, may be it's my lack of understanding. I tried few codes which didnt' work for me. They are given below.

HTML code

<a href="https://www.lovemycosmetic.de/skin1004-madagascar-centella-ampoule-30ml-"  title="SKIN1004 Madagascar Centella Ampoule 30ml"> <strong ><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">SKIN1004</font></font></strong><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">SKIN1004 Madagascar Centella Ampoule 30ml</font></font></a>

Test 1:

page.xpath('//a[@]')

Test 2:

page.xpath("//a[@class='info text-center']")

Test 3:

page.xpath('//a[@]')

Test 4:

page.xpath("//a[contains(@class, 'info') and contains(@class, 'text-center')]")

I did couple more tests too but I forgot to save the code. It will be great to know how to select elements with multiple classes using lxml.html.xpath().

CodePudding user response:

NB as far as XPath is concerned, the class attribute's value is a string like any other. It doesn't automatically parse the value as a list of space-delimited tokens, as a CSS selector would. In later versions of XPath you have the function contains-token() but lxml supports XPath 1.0 in which you basically have to tokenize the class value yourself.

If your class values are literally info text-center then you can test it with the predicate [@], but that won't match a class value of e.g. text-center info or info text-center foo bar. I'd recommend you use the XPath contains() function, e.g.

//a[contains(@class, "info")][contains(@class, "text-center")]

CodePudding user response:

Your test1 and test2 should both work fine, this is the code I used to get the results.

from lxml.html import etree
root = etree.fromstring('<a href="https://www.lovemycosmetic.de/skin1004-madagascar-centella-ampoule-30ml-"  title="SKIN1004 Madagascar Centella Ampoule 30ml"> <strong ><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">SKIN1004</font></font></strong><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">SKIN1004 Madagascar Centella Ampoule 30ml</font></font></a>')
elem = root.xpath('//a[@]')[0]
url = elem.xpath('./@href')[0]
print(elem, url)

OUTPUT:

<Element a at 0x1ef01509940> https://www.lovemycosmetic.de/skin1004-madagascar-centella-ampoule-30ml-
  • Related