<p>
A
<br>
<br>
B
<a ...>
<span >C</span>
</a>
D
<a ...>
<span >E</span>
</a>
F
</p>
I want to get the result "ABCDEF".
I know xpath(text()).getall()
can get "A","B","D","F"
and xpath(./*)
can get the "C" and "E"
But then I won't know the correct order of the elements, how should I do this?
CodePudding user response:
The xpath expression txt = ''.join([ x.get().strip() for x in response.xpath('//p//text()')])
will pull the value "ABCDEF".
Proven by scrapy shell:
In [1]: from scrapy.selector import Selector
In [2]: %paste
html = '''
<p>
A
<br>
<br>
B
<a ...>
<span >C</span>
</a>
D
<a ...>
<span >E</span>
</a>
F
</p>
'''
## -- End pasted text --
In [3]: res= Selector(text=html)
In [4]: res.xpath('//p//text()').getall()
Out[4]:
['\n A\n ',
'\n ',
'\n B\n ',
'\n ',
'C',
'\n ',
'\n D\n ',
'\n ',
'E',
'\n ',
'\n F\n']
In [5]: txt = [ x.get().strip() for x in res.xpath('//p//text()')]
In [6]: txt
Out[6]: ['A', '', 'B', '', 'C', '', 'D', '', 'E', '', 'F']
In [7]: txt = ''.join([ x.get().strip() for x in res.xpath('//p//text()')])
In [8]: txt
Out[8]: 'ABCDEF'
CodePudding user response:
xpath('/p/text()')
or
xpath('p ::text()')
Both should work. Check this answer for more clarity. Also, if are using python - make a list() of the elements your extracting to maintain the order.