How can i count the number of details before the next h3?
I prepared a basic test snippet which looks like this:
import lxml
from lxml import html
test_html="""
<main>
<h3 id="test1">
<p>test1 desc</p>
<details></details>
<details></details>
<details></details>
<details></details>
<h3 id="test2">
<p>test2 desc</p>
<details></details>
<details></details>
<details></details>
<details></details>
<h3 id="test3">
<p>test3 desc</p>
<details></details>
<details></details>
<details></details>
<details></details>
</main>
"""
main_content=html.fromstring(test_html)
I could save the id, but I actually don't know how I could use the xpath to count just the preceding-siblings h3 with that id.
#count number of details before each h3
for idx, h3 in enumerate(main_content.xpath("//h3")):
#print h3 id
print(h3.get("id"))
#here is there any way with the xpath to count the preceding-sibling specifying the id?
print(len(h3.xpath("preceding-sibling::details")))
This outputs:
test1 0 test2 4 test3 8
EDIT: Might have solved it with:
print(main_content.xpath(f"count(//h3[@id='{h3.get('id')}']/following-sibling::details)-count(//h3[@id='{h3.get('id')}']/following-sibling::h3/following-sibling::details)"))
Seems to be working!
Is there any way to get the elements instead of the counts? I know I could use the NOT in the xpath, but I don't actually know how to put it in place: This is my take:
import lxml
from lxml import html
test_html="""
<main>
<h3 id="test1">
<p>test1 desc</p>
<details></details>
<details></details>
<details></details>
<details></details>
<h3 id="test2">
<p>test2 desc</p>
<details></details>
<details></details>
<details></details>
<details></details>
<h3 id="test3">
<p>test3 desc</p>
<details></details>
<details></details>
<details></details>
<details></details>
</main>
"""
main_content=html.fromstring(test_html)
#count number of details before each h3
for idx, h3 in enumerate(main_content.xpath("//h3")):
#print h3 id
print(h3.get("id"))
print(main_content.xpath(f"count(//h3[@id='{h3.get('id')}']/following-sibling::details)-count(//h3[@id='{h3.get('id')}']/following-sibling::h3/following-sibling::details)"))
#get the actual details
list_of_details= main_content.xpath(f"//h3[@id='{h3.get('id')}']/following-sibling::details)[not(//h3[@id='{h3.get('id')}']/following-sibling::h3/following-sibling::details)]")
This returns an exception:
lxml.etree.XPathEvalError: Invalid expression
Here's a link to try out: link Thank you in advance!
CodePudding user response:
I would do it this way. To demonstrate, I modified test_html
a bit:
test_html="""
<main>
<h3 id="test1">
<p>test1 desc</p>
<details></details>
<details></details>
<h3 id="test2">
<p>test2 desc</p>
<details></details>
<details></details>
<details></details>
<h3 id="test3">
<p>test3 desc</p>
<details></details>
<details></details>
<details></details>
<details></details>
</main>
"""
First, forwards:
for de in main_content.xpath('//h3'):
count=0
for child in de.xpath('.//following-sibling::*'):
if child.tag == "h3":
break
else:
if child.tag == "details":
count =1
print(count)
Output:
2
3
4
... and backwards:
for de in main_content.xpath('//h3'):
count=0
for child in list(reversed(de.xpath('.//preceding-sibling::*'))):
if child.tag == "h3":
break
else:
if child.tag == "details":
count =1
print(count)
Output:
0
2
3
And, finally, a note: things would have easier using a library which (unlike lxml) supports xpath>1.0.