How to use CSS selectors to select everything but the table element and its contents-CodePudding

Hi like title says I'm trying to use CSS selectors to select all elements in div.page except for the contained table. I'm trying to use the :not() pseudo selector but :not(table) doesn't seem to be functioning as described.

from requests_html import HTML

html_string="""<html>
<head>
  <title>Testing</title>
</head>
<body>
<div >
  <h2 >This is a heading</h2>
  <p>This is a paragraph.</p>

  <div >
    <table border=1>
      <tr >
        <th>th 1</th>
        <td>td 1</td>
      </tr>
      <tr >
        <th>th 2</th>
        <td>td 2</td>
      </tr>
    </table>
  </div>

  <h3 >a sub heading</h3>
  <p>This is also a paragraph.</p>
  <p>This is another paragraph.</p>

  <div>This is some text in a div element.</div>
  <a href="https://www.blah.com" target="_blank">Blah!</a>
</div>
</body>"""

page = HTML(html=html_string)
page.find(':not(table), :not(table) *')

Returns the following list which clearly contains the table element and its contained elements with their text. I'm hoping to not include those.

[<Element 'html' >,
 <Element 'head' >,
 <Element 'title' >,
 <Element 'body' >,
 <Element 'div' class=('page', 'group')>,
 <Element 'h2' class=('level2',)>,
 <Element 'p' >,
 <Element 'div' class=('table',)>,
 <Element 'table' border='1'>,
 <Element 'tr' class=('row1',)>,
 <Element 'th' >,
 <Element 'td' >,
 <Element 'tr' class=('row2',)>,
 <Element 'th' >,
 <Element 'td' >,
 <Element 'h3' class=('level3',)>,
 <Element 'p' >,
 <Element 'p' >,
 <Element 'div' >,
 <Element 'a' href='https://www.blah.com' target='_blank'>]

If it's not possible with CSS selectors I'd be willing to accept an XPath solution.

CodePudding user response：

It's not entirely clear which elements you do want to select. You say you want the elements "in" div.page, but that could mean either elements which are direct children of that div, or elements which are descendants at a deeper level. I'm not clear as to whether you want to include the div element with class='table' which itself contains the table.

Anyway, I'm offering a few XPath expressions which might suit your purpose.

Here's an XPath that returns all the elements which are descendants of the "page" div, at any level, which aren't themselves either a table or have a table as an ancestor element.

/html/body/div[contains(@class, 'page')]
   //*
   [not(ancestor-or-self::table)]

It produces this result:

<h2 >This is a heading</h2>
<p>This is a paragraph.</p>
<div >
   <table border="1">
      <tr >
         <th>th 1</th>
         <td>td 1</td>
      </tr>
      <tr >
         <th>th 2</th>
         <td>td 2</td>
      </tr>
   </table>
</div>
<h3 >a sub heading</h3>
<p>This is also a paragraph.</p>
<p>This is another paragraph.</p>
<div>This is some text in a div element.</div>
<a href="https://www.blah.com" target="_blank">Blah!</a>

NB it returns the div which contains the table, but it doesn't return the table directly.

Here's another one which returns the same set of elements, but also excluding the ones which contain the table as a descendant. That excludes the div.table.

/html/body/div[contains(@class, 'page')]
   //*
   [not(ancestor-or-self::table)]
   [not(descendant-or-self::table)]

Producing this result:

<h2 >This is a heading</h2>
<p>This is a paragraph.</p>
<h3 >a sub heading</h3>
<p>This is also a paragraph.</p>
<p>This is another paragraph.</p>
<div>This is some text in a div element.</div>
<a href="https://www.blah.com" target="_blank">Blah!</a>

Here's one which returns the children of div.page, except those that either contain, or are contained by, a table.

/html/body/div[contains(@class, 'page')]
   /*
   [not(ancestor-or-self::table)]
   [not(descendant-or-self::table)]

When evaluated on the sample document you provided in your question, it produces the same result as the previous expression, but it would have a different effect for pages which had more deeply nested elements, e.g. div elements within div elements, etc. It would not duplicate elements which were nested.

I'm guessing, of course, because I don't know exactly the purpose of the selection you're making, but I suspect this last expression is most likely the best.