Home > Back-end >  XPath to the children as well as "text children"
XPath to the children as well as "text children"

Time:07-29

<li>
    <b>word</b>
    <i>type</i>
    <b>1.</b>
    "translation 1"
    <b>2.</b>
    "translation 2"           
</li>

I'm doing webscraping from an online dictionary, and the main dictionary part has roughly the above structure.

How exactly do I get all those children? With the usual selenium approach I see online, that is list_elem.find_elements(By.XPATH, ".//*") I only get the "proper" children, but not the textual ones (sorry if my word choice is off). Meaning I would like to have len(children) == 6, instead of len(children) == 4

I would like to get all children for further analysis

CodePudding user response:

Elements *, comment(), text(), and processing-instruction() are all nodes.

To select all nodes:

.//node()

To ensure that it's only selecting * and text() you can add a predicate filter:

.//node()[self::* or self::text()]

CodePudding user response:

I'm not a Selenium expert but I've read StackOverflow answers where apparently knowledgeable people have asserted that Selenium's XPath queries must return elements (so text nodes are not supported as a query result type), and I'm pretty sure that's correct.

So a query like like //* (return every element in the document) will work fine in Selenium, but //text() (return every text node in the document) won't, because although it's a valid XPath query, it returns text nodes rather than elements.

I suggest you consider using a different XPath API to execute your XPath queries, e.g. lxml, which doesn't have that limitation.

  • Related