On scraping a site, I have an HTML like this:
<div >
<div >
<h1 >Text I don't want</h1>
<ul>....</ul> <!-- containing more text in nested children, don't want -->
</div>
Text I want to grab.
<br>
More text I want to grab
</div>
Here, how can I select only the text I want to grab, i.e ["Text I want to grab", "More text I want to grab"]
and prevent selecting Text I don't want
. I am trying to select using CSS selector like this:
text = response.css('.classA:not(.classD) *::text').getall()
Does anyone know, what to do in this case, I am not familiar with xpath, but please do suggest if have a solution in it?
CodePudding user response:
You are about to reach your goal. You want to prevent <h1 >Text I don't want</h1>
using :not that's correct but you have to select the entire portion of html from where there is your desired output meaning you have to select <div >
at first then you have to prevent whatever you want. so the css expression should be like:
response.css('div.classA.classB.classC:not(.classF)::text').getall()
OR
' '.join([x.strip() for x in resp.css('div.classA.classB.classC:not(.classF)::text').getall()])
Proven by scrapy shell:
In [1]: from scrapy.selector import Selector
In [2]: %paste
html='''
<div >
<div >
<h1 >Text I don't want</h1>
<ul>....</ul> <!-- containing more text in nested children, don't want -->
</div>
Text I want to grab.
<br>
More text I want to grab
</div>
'''
## -- End pasted text --
In [3]: resp=Selector(text=html)
In [4]: ''.join(resp.css('div.classA.classB.classC:not(.classF)::text').getall()).strip()
Out[4]: 'Text I want to grab.\n \n More text I want to grab'
In [5]: ''.join(resp.css('div.classA.classB.classC:not(.classF)::text').getall()).replace('\n',''
...: ).strip()
Out[5]: 'Text I want to grab. More text I want to grab'
In [6]: ''.join(resp.css('div.classA.classB.classC:not(.classF)::text').getall()).strip().replace
...: ('\n','').strip()
Out[6]: 'Text I want to grab. More text I want to grab'
Out[7]: ['', 'Text I want to grab.', 'More text I want to grab']
In [8]: ''.join([x.strip() for x in resp.css('div.classA.classB.classC:not(.classF)::text').getal
...: l()])
Out[8]: 'Text I want to grab.More text I want to grab'
In [9]: ''.join([x.strip() for x in resp.css('div.classA.classB.classC:not(.classF)::text').getall()])
Out[9]: 'Text I want to grab.More text I want to grab'
In [10]: ' '.join([x.strip() for x in resp.css('div.classA.classB.classC:not(.classF)::text').getall()])
Out[10]: ' Text I want to grab. More text I want to grab'