Hi like title says I'm trying to use CSS selectors to select all elements in div.page except for the contained table. I'm trying to use the :not()
pseudo selector but :not(table)
doesn't seem to be functioning as described.
from requests_html import HTML
html_string="""<html>
<head>
<title>Testing</title>
</head>
<body>
<div >
<h2 >This is a heading</h2>
<p>This is a paragraph.</p>
<div >
<table border=1>
<tr >
<th>th 1</th>
<td>td 1</td>
</tr>
<tr >
<th>th 2</th>
<td>td 2</td>
</tr>
</table>
</div>
<h3 >a sub heading</h3>
<p>This is also a paragraph.</p>
<p>This is another paragraph.</p>
<div>This is some text in a div element.</div>
<a href="https://www.blah.com" target="_blank">Blah!</a>
</div>
</body>"""
page = HTML(html=html_string)
page.find(':not(table), :not(table) *')
Returns the following list which clearly contains the table element and its contained elements with their text. I'm hoping to not include those.
[<Element 'html' >,
<Element 'head' >,
<Element 'title' >,
<Element 'body' >,
<Element 'div' class=('page', 'group')>,
<Element 'h2' class=('level2',)>,
<Element 'p' >,
<Element 'div' class=('table',)>,
<Element 'table' border='1'>,
<Element 'tr' class=('row1',)>,
<Element 'th' >,
<Element 'td' >,
<Element 'tr' class=('row2',)>,
<Element 'th' >,
<Element 'td' >,
<Element 'h3' class=('level3',)>,
<Element 'p' >,
<Element 'p' >,
<Element 'div' >,
<Element 'a' href='https://www.blah.com' target='_blank'>]
If it's not possible with CSS selectors I'd be willing to accept an XPath solution.
CodePudding user response:
It's not entirely clear which elements you do want to select. You say you want the elements "in" div.page
, but that could mean either elements which are direct children of that div
, or elements which are descendants at a deeper level. I'm not clear as to whether you want to include the div
element with class='table'
which itself contains the table
.
Anyway, I'm offering a few XPath expressions which might suit your purpose.
Here's an XPath that returns all the elements which are descendants of the "page" div
, at any level, which aren't themselves either a table
or have a table
as an ancestor element.
/html/body/div[contains(@class, 'page')]
//*
[not(ancestor-or-self::table)]
It produces this result:
<h2 >This is a heading</h2>
<p>This is a paragraph.</p>
<div >
<table border="1">
<tr >
<th>th 1</th>
<td>td 1</td>
</tr>
<tr >
<th>th 2</th>
<td>td 2</td>
</tr>
</table>
</div>
<h3 >a sub heading</h3>
<p>This is also a paragraph.</p>
<p>This is another paragraph.</p>
<div>This is some text in a div element.</div>
<a href="https://www.blah.com" target="_blank">Blah!</a>
NB it returns the div
which contains the table
, but it doesn't return the table
directly.
Here's another one which returns the same set of elements, but also excluding the ones which contain the table as a descendant. That excludes the div.table
.
/html/body/div[contains(@class, 'page')]
//*
[not(ancestor-or-self::table)]
[not(descendant-or-self::table)]
Producing this result:
<h2 >This is a heading</h2>
<p>This is a paragraph.</p>
<h3 >a sub heading</h3>
<p>This is also a paragraph.</p>
<p>This is another paragraph.</p>
<div>This is some text in a div element.</div>
<a href="https://www.blah.com" target="_blank">Blah!</a>
Here's one which returns the children of div.page
, except those that either contain, or are contained by, a table
.
/html/body/div[contains(@class, 'page')]
/*
[not(ancestor-or-self::table)]
[not(descendant-or-self::table)]
When evaluated on the sample document you provided in your question, it produces the same result as the previous expression, but it would have a different effect for pages which had more deeply nested elements, e.g. div
elements within div
elements, etc. It would not duplicate elements which were nested.
I'm guessing, of course, because I don't know exactly the purpose of the selection you're making, but I suspect this last expression is most likely the best.