I tried this code to get the HTML content of element div.entry-content
:
response.css('div.entry-content').get()
However, it returns the wrapping element too:
<div >
<p>**my content**</p>
<p>more content</p>
</div>
But I want just the contents, so in my case: <p>**my content**</p><p>more content</p>
I also tried an xpath selector response.xpath('//div[@]').get()
, but with the same result as above.
Based on F.Hoque's answer below I tried:
response.xpath('//article/div[@]//p/text()').getall()
and response.xpath('//article/div[@]//p').getall()
These however, returns arrays of respectively all p
elements and the content of each found p
element. I however want the HTML contents (in a single value) of the div.entry-content
element without the wrapping element itself.
I've tried Googling, but can't find anything.
CodePudding user response:
As you said, your main div contains multiple p tags and you want to extract the text node value from those p tags. //p will select all the p tags.
response.xpath('//div[@]//p').getall()
The following expression will remove the array
p_tags = ''.join([x.get() for x in response.xpath('//article/div[@]//p')])
CodePudding user response:
You content is in the <p>
tag, not the <div>
response.css('div.entry-content p').get()
or
response.xpath('//div[@]/p').get()