How to count previous html tags with no root element (lxml python3) EDIT: Also get the elements-CodePudding

How can i count the number of details before the next h3?

I prepared a basic test snippet which looks like this:

import lxml
from lxml import html
test_html="""
<main>
   <h3 id="test1">
   <p>test1 desc</p>
   <details></details>
   <details></details>
   <details></details>
   <details></details>
   <h3 id="test2">
   <p>test2 desc</p>
   <details></details>
   <details></details>
   <details></details>
   <details></details>
   <h3 id="test3">
   <p>test3 desc</p>
   <details></details>
   <details></details>
   <details></details>
   <details></details>
</main>
"""

main_content=html.fromstring(test_html)

I could save the id, but I actually don't know how I could use the xpath to count just the preceding-siblings h3 with that id.

#count number of details before each h3
for idx, h3 in enumerate(main_content.xpath("//h3")):
    #print h3 id
    print(h3.get("id"))
    #here is there any way with the xpath to count the preceding-sibling specifying the id?
    print(len(h3.xpath("preceding-sibling::details")))

This outputs:

test1 0 test2 4 test3 8

EDIT: Might have solved it with:

print(main_content.xpath(f"count(//h3[@id='{h3.get('id')}']/following-sibling::details)-count(//h3[@id='{h3.get('id')}']/following-sibling::h3/following-sibling::details)"))

Seems to be working!

Is there any way to get the elements instead of the counts? I know I could use the NOT in the xpath, but I don't actually know how to put it in place: This is my take:

import lxml
from lxml import html
test_html="""
<main>
   <h3 id="test1">
   <p>test1 desc</p>
   <details></details>
   <details></details>
   <details></details>
   <details></details>
   <h3 id="test2">
   <p>test2 desc</p>
   <details></details>
   <details></details>
   <details></details>
   <details></details>
   <h3 id="test3">
   <p>test3 desc</p>
   <details></details>
   <details></details>
   <details></details>
   <details></details>
</main>
"""

main_content=html.fromstring(test_html)
#count number of details before each h3
for idx, h3 in enumerate(main_content.xpath("//h3")):
    #print h3 id
    print(h3.get("id"))
    print(main_content.xpath(f"count(//h3[@id='{h3.get('id')}']/following-sibling::details)-count(//h3[@id='{h3.get('id')}']/following-sibling::h3/following-sibling::details)"))
    #get the actual details
    list_of_details= main_content.xpath(f"//h3[@id='{h3.get('id')}']/following-sibling::details)[not(//h3[@id='{h3.get('id')}']/following-sibling::h3/following-sibling::details)]")

This returns an exception:

lxml.etree.XPathEvalError: Invalid expression

Here's a link to try out: link Thank you in advance!

CodePudding user response：

I would do it this way. To demonstrate, I modified test_html a bit:

test_html="""
<main>
   <h3 id="test1">
   <p>test1 desc</p>
   <details></details>
   <details></details>
   <h3 id="test2">
   <p>test2 desc</p>
   <details></details>
   <details></details>
   <details></details>
   <h3 id="test3">
   <p>test3 desc</p>
   <details></details>
   <details></details>
   <details></details>
   <details></details>
</main>
"""

First, forwards:

for de in main_content.xpath('//h3'):
    count=0
    for child in de.xpath('.//following-sibling::*'):
        if child.tag == "h3":            
            break
        else:
            if child.tag == "details":
                count =1
    print(count)

Output:

2
3
4

... and backwards:

for de in main_content.xpath('//h3'):
    count=0
    for child in list(reversed(de.xpath('.//preceding-sibling::*'))):
        if child.tag == "h3":            
            break
        else:
            if child.tag == "details":
                count =1     
    print(count)

Output:

0
2
3

And, finally, a note: things would have easier using a library which (unlike lxml) supports xpath>1.0.