Home > Back-end >  Finding sibling items between headers
Finding sibling items between headers

Time:05-04

I'm trying to scrape some documentation for data files composed in XML. I had been writing the XSD manually by reading the pages and typing it out, but it has occurred to me that this is a prime case for page scraping. The general format (as far as I can tell based on a random sample) is something like the following:

   <h2>
    <span  id="The_.22show.22_Child_Element">
     The "show" Child Element
    </span>
   </h2>
   <dl>
    <dd>
     <table >
      <tr>
       <td >
        allowstack
       </td>
       <td>
        (Optional) Boolean – Indicates whether the user is allowed to stack items within the table, subject to the restrictions imposed for each item. Default: "yes".
       </td>
      </tr>
      <tr>
       <td>
        agentlist
       </td>
       <td>
        (Optional) Id – If set to a tag group id, only picks with the agent pick's identity tag from that group are shown in the table. Default: None.
       </td>
      </tr>
      <tr>
       <td>
        allowmove
       </td>
       <td>
        (Optional) Boolean – Indicates whether picks in this table can be moved out of this table, if the user drags them around. Default: "yes".
       </td>
      </tr>
      <tr>
       <td>
        listpick
       </td>
       <td>
        (Optional) Id – Unique id of the pick to take the table's list expression from (see listfield, below). Note that this does not work when used with portals. Default: None.
       </td>
      </tr>
      <tr>
       <td>
        listfield
       </td>
       <td>
        (Optional) Id – Unique id of the field to take the table's list expression from (see listpick, above). Note that this does not work when used with portals. Default: None.
       </td>
      </tr>
     </table>
    </dd>
   </dl>
   <p>
    The "show" element also possesses child elements that define additional behaviors of the table. The list of these child elements is below and must appear in the order shown. Click on the link to access the details for each element.
   </p>
   <dl>
    <dd>
     <table >
      <tr>
       <td >
        <a href="index.php5@title=TableDef_Element_(Data).html#list">
         list
        </a>
       </td>
       <td>
        An optional "list" element may appear as defined by the given link. This element defines a
        <a href="index.php5@title=List_Tag_Expression.html" title="List Tag Expression">
         List Tag Expression
        </a>
        for the table.
       </td>
      </tr>
     </table>
    </dd>
   </dl>

There's a pretty clear pattern of each file having a number of elements defined by a header followed by text followed by a table (generally the attributes), and possibly another set of text and a table (for the child elements). I think I can reach a reasonable solution by simply using next or next-sibling to step through items and trying to scan the text to determine if the following table is attributes or classes, but it feels a bit weird that I can't just grab everything in between two header tags and then scan that.

CodePudding user response:

You can search for multiple elements at the same time, for example <h2> and <table>. You can then make a note of each <h2> contents before processing each <table>.

For example:

soup = BeautifulSoup(html, "html.parser")

for el in soup.find_all(['h2', 'table']):
    if el.name == 'h2':
        h2 = el.get_text(strip=True)
        h2_id = el.span['id']
    else:
        for tr in el.find_all('tr'):
            row = [td.get_text(strip=True) for td in tr.find_all('td')]
            print([h2, h2_id, *row])
  • Related