How to scrape all visible text but exlude text written on hyperlinks?-CodePudding

I am interested in all of the visible text of a website.

The only thing is: I would like to exclude hyperlink text. Thereby I am able to exlude text in menu bars because they often contain links. In the image you can see that everything from a menu bar could be excluded (e.g. "Wohnen & Bauen").

https://www.gross-gerau.de/Bürger-Service/Ver-und-Entsorgung/Abfallinformationen/index.php?object=tx,2289.12976.1&NavID=3411.60&La=1

All in all my spider looks like this:

class MySpider(CrawlSpider):
    name = 'my_spider'

    start_urls = ['https://www.gross-gerau.de/Bürger-Service/Wohnen-Bauen/']

    rules = (
            Rule(LinkExtractor(allow="Bürger-Service", deny=deny_list_sm),
                 callback='parse', follow=True),
        )


    def parse(self, response):

        item = {}
        item['scrape_date'] = int(time.time())
        item['response_url'] = response.url

        # old approach 
        # item["text"] = " ".join([x.strip() for x in response.xpath("//text()").getall()]).strip()
        # exclude at least javascript code snippets and stuff 
        item["text"] = " ".join([x.strip() for x in response.xpath("//*[name(.)!='head' and name(.)!='script']/text()").getall()]).strip()

        yield item

The solution should work for other websites, too.Does anyone have an idea how to solve this challenge? Any ideas are welcome!

CodePudding user response：

You can extend your predicate as

[name()!='head' and name()!='script' and name()!='a']