How to get a tree of all xpaths in a website using Python?-CodePudding

Approach I

While trying to get a hierarchical tree of all the xpaths in a website ( I would expect the output to look something like this:


| /html

|-- //*[@id="browser-upgrade-notification"]

|-- //*[@id="app"]

|-- /html/head

|-- /html/body
|--/-- /html/body/noscript
|--/-- /html/body/div[2]

|--/-- /html/body/header/section
|--/--/-- /html/body/header/section/div
|--/--/--/-- /html/body/header/section/div/div[1]
....

This would be an example of the list of tree.

CodePudding user response：

/html/body/ is not a valid XPath, /html/body can be used instead.
/html/body/div[6] is matching a single element on that page while /html/body/div[6]/* matches 3 elements.
//* will return you all the elements on the page.
Anyway, driver.find_elements_by_xpath returns a list of web elements matching the passed XPath locator. This will not give you XPaths of the nodes on the page.
This method receives XPath as a parameter and returns a list of web elements.

CodePudding user response：

The total number of XPaths that select one or more elements is infinite (for example it will include paths like /a/b/../b/../b/../b), but if you restrict yourself to paths of the form /a[i]/b[j]/c[k] then the number of paths is equal to the number of elements, and the "tree" of XPaths is isomorphic with the original XML tree.

If you want the distinct paths without a numerical predicate, for example /a/b/c, /a/b/d, then the simplest approach is probably to walk the XML document, get the path for each element (in this form) and eliminate duplicates. If rather than a flat list of paths you want a tree structure, then build it up as you go using nested maps/dictionaries.

The reason it complains about /html/body/ is that a legal XPath expression cannot contain a trailing /.