<li>
<b>word</b>
<i>type</i>
<b>1.</b>
"translation 1"
<b>2.</b>
"translation 2"
</li>
I'm doing webscraping from an online dictionary, and the main dictionary part has roughly the above structure.
How exactly do I get all those children? With the usual selenium approach I see online, that is list_elem.find_elements(By.XPATH, ".//*")
I only get the "proper" children, but not the textual ones (sorry if my word choice is off). Meaning I would like to have len(children) == 6
, instead of len(children) == 4
I would like to get all children for further analysis
CodePudding user response:
Elements *
, comment()
, text()
, and processing-instruction()
are all nodes.
To select all nodes:
.//node()
To ensure that it's only selecting *
and text()
you can add a predicate filter:
.//node()[self::* or self::text()]
CodePudding user response:
I'm not a Selenium expert but I've read StackOverflow answers where apparently knowledgeable people have asserted that Selenium's XPath queries must return elements (so text nodes are not supported as a query result type), and I'm pretty sure that's correct.
So a query like like //*
(return every element in the document) will work fine in Selenium, but //text()
(return every text node in the document) won't, because although it's a valid XPath query, it returns text nodes rather than elements.
I suggest you consider using a different XPath API to execute your XPath queries, e.g. lxml
, which doesn't have that limitation.