Using XPATH to get text AFTER one closed tag and BEFORE the beginning of another specific tag?-CodePudding

I'm using XPATH to extract information from a website which generates data of the following structure:

<span >
    <span >aaa:</span> <a href="bbb"><strong>ccc</strong></a><br>
    <span >ddd:</span> eee<br>
    <span >fff:</span> <b>ggg gg </b><br>
    ...
    <span >hhh:</span>
        <a href="iii">jjj</a>,
        ...
        <a href="kkk">lll</a><br>
        <br>
</span>
<span >mmm <b>nnn</b> ...
        <br><br>
</span>
<span >
    <span >ooo:</span> ppp<br>
    <span >qqq:</span> rrr<br>
    ...
</span>

A few things to note first:

the exact number of  tags varies
the <a> tags after hhh: varies

To extract what follows the individual classA1 spans, I use this XPATH definition:

//span[contains(text(),'aaa:')]//following::text()[1]
//span[contains(text(),'ddd:')]//following::text()[1]

//span[contains(text(),'fff:')]//following::text()[1]
...

And so on.

Trying to extract the text after hhh:, that is, either the plain text "jjj" and "lll" or the whole html part (i.e. "<a href="iii">jjj</a>,...<a href="kkk">lll</a>"), I keep running into problems.

Since, as I mention above, the number of tags there may vary greatly and is unpredictable, I cannot simply identify them by index number. And if I use the following, I also get everything that follows including the following classB span, which I definitely don't need or want.

//span[contains(text(),'hhh:')]//following::text()

Can you, please, suggest an XPATH solution?

Many thanks!

CodePudding user response：

If I understand correctly what are you asking for, this should give you all the a elements coming after the hhh: element:

//span[@class='classA1' and text()='hhh']/following-sibling::a

Now you can iterate over the list or resulting a elements and extract their texts.
Alternatively you can get their texts directly with this:

//span[@class='classA1' and text()='hhh']/following-sibling::a/text()

CodePudding user response：

Since your source html shows indention not corresponding the parent/child relation, it is not totally clear but maybe this helps:

//span[contains(.,'mmm')]/preceding::span[contains(.,'hhh:')][1]/following-sibling::a[not(span[contains(.,'mmm')])]