Home > Enterprise >  LXML/Requests-HTML: Get element after plain text, both children of same element?
LXML/Requests-HTML: Get element after plain text, both children of same element?

Time:10-09

I'm new to web scraping, and I'm currently using requests, requests-html, and lxml.

I'm having trouble figuring out how to target the "DESIRED INFO" from the specific <span> element in the following circumstance (a lot of info is thrown into a single <td>):

Note: no attributes differentiate the span elements, so I need to go by the plain text within the <td> that occurs right before the span.

Note 2: there are no element wrappers (e.g. <p>, <b>, etc.) around that plain text (OTHER TEXT, CONSISTENT TEXT, etc.)... they are just plain html text, immediate "children" of <td>

<td>
  OTHER TEXT
  <span>...</span>
  ...
  ...
  OTHER TEXT 2
  <span>...</span>
  ...
  ...
  CONSISTENT TEXT:
  <span>DESIRED INFO</span>
  ...
  ...
  OTHER TEXT 3
  <span>...</span>
  ...
  ...
  OTHER TEXT 4
  <span>...</span>
  ...
  ...
</td>

What I'm currently doing is looking for all the different possibilities that could exist in the DESIRED INFO spot, and grabbing the span element based on that, but that is insufficient because some of those <span> elements after the OTHER TEXT can contain the same contents.

What I'm currently doing (insufficient):

  spanDesiredInfoList = []
  for el in tree.xpath('//span[text()="POSSIBILITY 1"]'):
    spanDesiredInfoList.append(el)
  for el in tree.xpath('//span[text()="POSSIBILITY 2"]'):
    spanDesiredInfoList.append(el)
  ...
  ...
  # attempt to handle final list and get the correct span (basically impossible)

Thank you for your help!

CodePudding user response:

Since the indices in the text lists of the <td> tag and <span> tags are the same, when we find the desired substring, this will be the desired index of the element:

from lxml import etree

text = '''
<html>
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
    </head>
    <body>
        <table>
            <tr>
                <td>
                    OTHER TEXT
                    OTHER TEXT
                    <span>uuu</span>
                    OTHER TEXT 2
                    OTHER TEXT 2
                    <span>ttt</span>
                    OTHER TEXT
                    CONSISTENT TEXT:
                    OTHER TEXT
                    <span>DESIRED INFO</span>
                    OTHER TEXT 3
                    <span>uuu</span>
                    OTHER TEXT 4
                    <span>ttt</span>
                </td>
            </tr>
        </table>
    </body>
</html>
'''


html = etree.HTML(text)
# Getting lists of texts.
result_td = html.xpath('//td/text()')
result = html.xpath('//span/text()')
# We are looking for the desired substring.
for i, el in enumerate(result_td):
    if el.find('CONSISTENT TEXT:') != -1:
        print(result[i])
DESIRED INFO

CodePudding user response:

Thank you @Сергей Кох for the logic I needed to solve this problem. That was the correct answer to my question, and I marked it as so.

I simplified the situation for the question just slightly since it was the logic I was seeking vs exact implementation, so here now is the exact implementation I used, if it's useful to anyone.

Notes:

  • There was a <div> child directly within the <td> containing everything (oversight)
  • The <span> tags were actually <b> tags
  • Most of the <b> tags had direct text within them, however some (including the one I was interested in) had a <span> element direct child that then directly contained the desired text
  • it turned out that the indices of the lists were not quite identical in reality. However, I could "reset" the indices of the result_td by figuring out when I got to the correct area within the HTML when the result indices started, so that indices would then line up 1:1 (incorporated indexForReset to do this) (I could do this because FIRST OTHER TEXT is also consistent)
from lxml import etree

text = '''
<html>
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
    </head>
    <body>
        <table>
            <tr>
                <td>
                <div>
                    OTHER TEXT STUFF
                    OTHER TEXT STUFF
                    OTHER TEXT STUFF
                    ...
                    FIRST OTHER TEXT
                    <b>uuu</b>
                    SECOND OTHER TEXT
                    <b>ttt</b>
                    THIRD OTHER TEXT
                    <b><span>ttt</span></b>
                    CONSISTENT TEXT:
                    <b><span>DESIRED INFO</span></b>
                    FOURTH OTHER TEXT
                    <b>sss</b>
                    FIFTH OTHER TEXT
                    <b>rrr</b>
                    ...
                    OTHER TEXT STUFF
                    OTHER TEXT STUFF
                    OTHER TEXT STUFF
                </div>
                </td>
            </tr>
        </table>
    </body>
</html>
'''

html = etree.HTML(text)
# Getting lists of texts.
result_td = html.xpath('//td/div/text()')
result = html.xpath('//td/div/b')
# We are looking for the desired substring.
indexForReset = 0
for i, el in enumerate(result_td):
    if el.find('FIRST OTHER TEXT') != -1 and indexForReset == 0:
        indexForReset = i
    if el.find('CONSISTENT TEXT:') != -1:
        #get all text content of element subtree (within <span> in this case)
        print(''.join(result[i-indexForReset].itertext()))
        break
DESIRED INFO

Thank you again @Сергей Кох for getting me here.

  • Related