I'm using XPATH to extract information from a website which generates data of the following structure:
<span >
<span >aaa:</span> <a href="bbb"><strong>ccc</strong></a><br>
<span >ddd:</span> eee<br>
<span >fff:</span> <b>ggg gg </b><br>
...
<span >hhh:</span>
<a href="iii">jjj</a>,
...
<a href="kkk">lll</a><br>
<br>
</span>
<span >mmm <b>nnn</b> ...
<br><br>
</span>
<span >
<span >ooo:</span> ppp<br>
<span >qqq:</span> rrr<br>
...
</span>
A few things to note first:
- the exact number of <span > tags varies
- the <a> tags after <span >hhh:<span> varies
To extract what follows the individual classA1 spans, I use this XPATH definition:
//span[contains(text(),'aaa:')]//following::text()[1]
//span[contains(text(),'ddd:')]//following::text()[1]
//span[contains(text(),'fff:')]//following::text()[1]
...
And so on.
Trying to extract the text after <span >hhh:<span>, that is, either the plain text "jjj" and "lll" or the whole html part (i.e. "<a href="iii">jjj</a>,...<a href="kkk">lll</a>"), I keep running into problems.
Since, as I mention above, the number of tags there may vary greatly and is unpredictable, I cannot simply identify them by index number. And if I use the following, I also get everything that follows including the following classB span, which I definitely don't need or want.
//span[contains(text(),'hhh:')]//following::text()
Can you, please, suggest an XPATH solution?
Many thanks!
CodePudding user response:
If I understand correctly what are you asking for, this should give you all the a
elements coming after the <span >hhh:</span>
element:
//span[@class='classA1' and text()='hhh']/following-sibling::a
Now you can iterate over the list or resulting a
elements and extract their texts.
Alternatively you can get their texts directly with this:
//span[@class='classA1' and text()='hhh']/following-sibling::a/text()
CodePudding user response:
Since your source html shows indention not corresponding the parent/child relation, it is not totally clear but maybe this helps:
//span[contains(.,'mmm')]/preceding::span[contains(.,'hhh:')][1]/following-sibling::a[not(span[contains(.,'mmm')])]