Home > OS >  select all text nodes inside an element without text in child elements
select all text nodes inside an element without text in child elements

Time:06-27

On scraping a site, I have an HTML like this:

<div >
  <div >
    <h1 >Text I don't want</h1>
    <ul>....</ul> <!-- containing more text in nested children, don't want -->
  </div>
  Text I want to grab.
  <br>
  More text I want to grab
</div>

Here, how can I select only the text I want to grab, i.e ["Text I want to grab", "More text I want to grab"] and prevent selecting Text I don't want. I am trying to select using CSS selector like this:

text = response.css('.classA:not(.classD) *::text').getall()

Does anyone know, what to do in this case, I am not familiar with xpath, but please do suggest if have a solution in it?

CodePudding user response:

You are about to reach your goal. You want to prevent <h1 >Text I don't want</h1> using :not that's correct but you have to select the entire portion of html from where there is your desired output meaning you have to select <div > at first then you have to prevent whatever you want. so the css expression should be like:

response.css('div.classA.classB.classC:not(.classF)::text').getall()

OR

' '.join([x.strip() for x in resp.css('div.classA.classB.classC:not(.classF)::text').getall()])

Proven by scrapy shell:

In [1]: from scrapy.selector import Selector

In [2]: %paste

html='''
<div >
  <div >
    <h1 >Text I don't want</h1>
    <ul>....</ul> <!-- containing more text in nested children, don't want -->
  </div>
  Text I want to grab.
  <br>
  More text I want to grab
</div>
'''

## -- End pasted text --

In [3]: resp=Selector(text=html)

In [4]: ''.join(resp.css('div.classA.classB.classC:not(.classF)::text').getall()).strip()
Out[4]: 'Text I want to grab.\n  \n  More text I want to grab'

In [5]: ''.join(resp.css('div.classA.classB.classC:not(.classF)::text').getall()).replace('\n','' 
   ...: ).strip()
Out[5]: 'Text I want to grab.    More text I want to grab'

In [6]: ''.join(resp.css('div.classA.classB.classC:not(.classF)::text').getall()).strip().replace 
   ...: ('\n','').strip()
Out[6]: 'Text I want to grab.    More text I want to grab'

Out[7]: ['', 'Text I want to grab.', 'More text I want to grab']

In [8]: ''.join([x.strip() for x in resp.css('div.classA.classB.classC:not(.classF)::text').getal
   ...: l()])
Out[8]: 'Text I want to grab.More text I want to grab'

In [9]: ''.join([x.strip() for x in resp.css('div.classA.classB.classC:not(.classF)::text').getall()])
Out[9]: 'Text I want to grab.More text I want to grab'

In [10]: ' '.join([x.strip() for x in resp.css('div.classA.classB.classC:not(.classF)::text').getall()])        
Out[10]: ' Text I want to grab. More text I want to grab'
  • Related