Home > Back-end >  How do you delete a subelement from a Scrapy Selector?
How do you delete a subelement from a Scrapy Selector?

Time:11-09

I'm trying to scrape the content of some forum posts with Scrapy, and I want to exclude text that is quoted from a previous post. I'm lucky that the website marks this quoted text very clearly (it's inside "blockquote" tags), but I can't figure out how to get all the text that is not in a blockquote tag. There's an example of the forum post structure below. In this particular post, the user writes something, then quotes the previous post, then writes some more. So basically, the tag I want to get rid of is sandwiched between content that I want. More usually, the quoted text would be first and new text would follow, but I need to be able to handle weird cases like this as well.

I tried using the w3lib remove_tags:

from w3lib.html import remove_tags, remove_tags_with_content    
body = post.css('div.bbWrapper')[0]
content = remove_tags(remove_tags_with_content(body, ('blockquote', )))

but I get an error: TypeError: to_unicode must receive a bytes, str or unicode object, got Selector

I've found instructions on how to do this with Beautiful Soup, but not Scrapy. If using BS is the only option, can I just switch to it in the middle of my Scrapy parse items method?

<article ...>
<div class="bbWrapper">TEXT I WANT TO COLLECT HERE<br>
<blockquote ...>
    <div class="bbCodeBlock-title">
    <a href="/forums/goto/post?id=1053788123" ...">OTHER GUY SAID:</a>
    </div>
    <div hljs-string">">
    <div hljs-string">">
    <b>TEXT I DON'T WANT<br>
    <br>
    TEXT I DON'T WANT</b>
    </div>
     <div hljs-string">"><a role="button" tabindex="0">TEXT I DON'T WANT</a></div>
     </div>
    </blockquote>
TEXT I WANT</div>
<div hljs-string">">&nbsp;</div>
<div style="margin:10px 0 10px 0;">
...
</div>
</article>

CodePudding user response:

@larapsodia, please take a tour Here.It may help you to solve the problem.

CodePudding user response:

First of all in the example you gave if I choose only the text inside div I get:

In [1]: response.xpath('.//div/text()').getall()
Out[1]:
['TEXT I WANT TO COLLECT HERE',
 '\r\n',
 '\r\n    ',
 '\r\n    ',
 '\r\n    ',
 '\r\n    ',
 '\r\n    ',
 '\r\n     ',
 '\r\n     ',
 '\r\nTEXT I WANT',
 '\xa0',
 '\r\n...\r\n']

So you can do something like this:

In [2]: [x.strip() for x in response.xpath('.//div/text()').getall() if x.strip()]
Out[2]: ['TEXT I WANT TO COLLECT HERE', 'TEXT I WANT', '...']

Or even better don't select it (search all div tags that doesn't have a div ancestor or self):

In [3]: response.xpath('//div//text()[not(ancestor-or-self::div[contains(@class,"bbCodeBlock")])]').getall()
Out[3]:
['TEXT I WANT TO COLLECT HERE',
 '\r\n',
 '\r\n    ',
 '\r\n    ',
 '\r\n    ',
 '\r\nTEXT I WANT',
 '\xa0',
 '\r\n...\r\n']

And you already know what to do with the list.

  • Related