Home > database >  How to extract pure texts and a tags which are on the same level?
How to extract pure texts and a tags which are on the same level?

Time:07-07

<p>
    A
    <br>
    <br>
    B
    <a ...>
        <span >C</span>
    </a>
    D
    <a ...>
        <span >E</span>
    </a>
    F
</p>
    

I want to get the result "ABCDEF".

I know xpath(text()).getall() can get "A","B","D","F"

and xpath(./*) can get the "C" and "E"

But then I won't know the correct order of the elements, how should I do this?

CodePudding user response:

The xpath expression txt = ''.join([ x.get().strip() for x in response.xpath('//p//text()')]) will pull the value "ABCDEF".

Proven by scrapy shell:

In [1]: from scrapy.selector import Selector

In [2]: %paste
html = '''
<p>
    A
    <br>
    <br>
    B
    <a ...>
        <span >C</span>
    </a>
    D
    <a ...>
        <span >E</span>
    </a>
    F
</p>
'''

## -- End pasted text --

In [3]: res= Selector(text=html)

In [4]: res.xpath('//p//text()').getall()
Out[4]: 
['\n    A\n    ',
 '\n    ',       
 '\n    B\n    ',
 '\n        ',   
 'C',
 '\n    ',       
 '\n    D\n    ',
 '\n        ',   
 'E',
 '\n    ',
 '\n    F\n']

In [5]: txt = [ x.get().strip() for x in res.xpath('//p//text()')]

In [6]: txt
Out[6]: ['A', '', 'B', '', 'C', '', 'D', '', 'E', '', 'F']

In [7]: txt = ''.join([ x.get().strip() for x in res.xpath('//p//text()')])

In [8]: txt
Out[8]: 'ABCDEF'

CodePudding user response:

xpath('/p/text()')

or 

xpath('p ::text()') 

Both should work. Check this answer for more clarity. Also, if are using python - make a list() of the elements your extracting to maintain the order.

  • Related