Home > Mobile >  Using XPATH to get text AFTER one closed tag and BEFORE the beginning of another specific tag?
Using XPATH to get text AFTER one closed tag and BEFORE the beginning of another specific tag?

Time:02-20

I'm using XPATH to extract information from a website which generates data of the following structure:

<span >
    <span >aaa:</span> <a href="bbb"><strong>ccc</strong></a><br>
    <span >ddd:</span> eee<br>
    <span >fff:</span> <b>ggg gg </b><br>
    ...
    <span >hhh:</span>
        <a href="iii">jjj</a>,
        ...
        <a href="kkk">lll</a><br>
        <br>
</span>
<span >mmm <b>nnn</b> ...
        <br><br>
</span>
<span >
    <span >ooo:</span> ppp<br>
    <span >qqq:</span> rrr<br>
    ...
</span>

A few things to note first:

  1. the exact number of <span > tags varies
  2. the <a> tags after <span >hhh:<span> varies

To extract what follows the individual classA1 spans, I use this XPATH definition:

//span[contains(text(),'aaa:')]//following::text()[1]
//span[contains(text(),'ddd:')]//following::text()[1]

//span[contains(text(),'fff:')]//following::text()[1]
...

And so on.

Trying to extract the text after <span >hhh:<span>, that is, either the plain text "jjj" and "lll" or the whole html part (i.e. "<a href="iii">jjj</a>,...<a href="kkk">lll</a>"), I keep running into problems.

Since, as I mention above, the number of tags there may vary greatly and is unpredictable, I cannot simply identify them by index number. And if I use the following, I also get everything that follows including the following classB span, which I definitely don't need or want.

//span[contains(text(),'hhh:')]//following::text()

Can you, please, suggest an XPATH solution?

Many thanks!

CodePudding user response:

If I understand correctly what are you asking for, this should give you all the a elements coming after the <span >hhh:</span> element:

//span[@class='classA1' and text()='hhh']/following-sibling::a

Now you can iterate over the list or resulting a elements and extract their texts.
Alternatively you can get their texts directly with this:

//span[@class='classA1' and text()='hhh']/following-sibling::a/text()

CodePudding user response:

Since your source html shows indention not corresponding the parent/child relation, it is not totally clear but maybe this helps:

//span[contains(.,'mmm')]/preceding::span[contains(.,'hhh:')][1]/following-sibling::a[not(span[contains(.,'mmm')])]
  • Related