Home > Software design >  (Scrapy) How do you scap all the external links on each website from a list of hundreds of websites
(Scrapy) How do you scap all the external links on each website from a list of hundreds of websites

Time:11-11

I am looking for some help regarding my Scrapy project. I want to use Scrapy to code a generic Spider that would crawl multiple websites from a list. I was hoping to have the list in a separate file, because it's quite large. For each website, the spider will navigate through internal links, and on each page, it will collect every external links.

I believe there are too many websites to create one spider per website. I want to scrap only external links, meaning "absolute" links whose domain name is different from the domain of the website where the link is found (subdomain would still be internal links from my POV).

Eventually, I want to export the results in a CSV with the following fields:

  • domain of the website being crawled, (from the list)
  • page_url (where the external link was found)
  • external_link If the same external link is found several times on the same page, it is deduped. Not yet sure though, but I might want to dedup external links on the website scope too, at some point.

At some point, I would also like to :

  • also filter certain external links to not be considered such as facebook.com/... etc,
  • to run the script from Zyte.com. I believe it constrains me to respect a certain structure for the code, rather than just a standalone script. Any suggestion on that aspect would really help too.

After a lot of research, I found this reference: https://coderedirect.com/questions/369975/dynamic-rules-based-on-start-urls-for-scrapy-crawlspider

But it wasn't clear to me how I can make it because it's missing a full version of the code.

So far, the code I developed is as below, but I am stuck, as it does not fulfill my needs:

import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy import Request
# import the link extractor
from scrapy.linkextractors import LinkExtractor
import os


class LinksSpider(scrapy.Spider):
    name = 'publishers_websites'
    start_urls = ['https://loremipsum.io/']
    allowed_domains = ['loremipsum.io']
    custom_settings = {'USER_AGENT': "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36"}

    try:
        os.remove('publishers_websites.txt')
    except OSError:
        pass

    custom_settings = {
        'CONCURRENT_REQUESTS': 2,
        'AUTO_THROTTLE_ENABLED': True
    }

    def __init__(self):
        self.link_extractor = LinkExtractor(unique=True)

    def parse(self, response):
        domain = 'https://loremipsum.io/'
        all_links = self.link_extractor.extract_links(response)
        for link in all_links:
            if domain not in link.url:
                with open('publishers_websites.txt', 'a ') as f:
                    f.write(f"\n{str(response.request.url), str(link.url)}")

            yield response.follow(url=link, callback=self.parse)

if __name__ == "__main__":
    process = CrawlerProcess()
    process.crawl(LinksSpider)
    process.start()

There aren't much answers for my problem, and my Python skills are not good enough to solve the problem by myself.

I would be very grateful for any help I receive.

CodePudding user response:

When you want to crawl a list of links, you have to pass them alongside the start_urls variable.

import scrapy


    class MySpider(scrapy.Spider):
        name = 'example.com'
        allowed_domains = ['example.com']
        start_urls = [
            'http://www.example.com/1.html',
            'http://www.example.com/2.html',
            'http://www.example.com/3.html',
        ]
    
        def parse(self, response):
            self.logger.info('A response from %s just arrived!', response.url)

Notice the start_urls list. It will automatically start the crawling event. There is no need to tell scrapy to read each URL. There is another method, in which, you set the URLs in a second function as:

import scrapy
from myproject.items import MyItem

class MySpider(scrapy.Spider):
    name = 'example.com'
    allowed_domains = ['example.com']

    def start_requests(self):
        yield scrapy.Request('http://www.example.com/1.html', self.parse)
        yield scrapy.Request('http://www.example.com/2.html', self.parse)
        yield scrapy.Request('http://www.example.com/3.html', self.parse)

    def parse(self, response):
        for h3 in response.xpath('//h3').getall():
            yield MyItem(title=h3)

        for href in response.xpath('//a/@href').getall():
            yield scrapy.Request(response.urljoin(href), self.parse)

Nevertheless, if your list of links is so big. You can load it in a data frame and loop over it with pandas, in the case you do not want to write the links in a list.

Cheers

CodePudding user response:

Please read about CrawlSpider and rules.

For example:

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor


class example(CrawlSpider):
    name = "example_spider"
    start_urls = ['https://example.com']
    rules = (Rule(LinkExtractor(), callback='parse_urls', follow=True),)

    def parse_urls(self, response):
        for url in response.xpath('//a/@href').getall():
            if url:
                yield {
                    'url': url
                }

Maybe you'll want to add a function to check if it's a valid url, or maybe extend relative urls to full urls. But generally speaking this example should work.

(Just create an init function to add your file to the start_urls and any other things you want to add).

(And I don't know anything about zyte...)

Edit1:

You can also use another link extractor inside 'parse_urls' if it's more comfortable to you.

Edit2:

About getting the urls from file you can do it in the init function:

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor


class example(CrawlSpider):
    name = "example_spider"

    def __init__(self, *args, **kwargs):
        self.rules = (Rule(LinkExtractor(allow_domains=['example.com']), callback='parse_urls', follow=True),)
        with open('urlsfile.txt', 'r') as f:
            self.start_urls = [line.strip() for line in f.readlines()]
        super(example, self).__init__(*args, **kwargs)

    def parse_urls(self, response):
        for url in response.xpath('//a/@href').getall():
            if url:
                yield {
                    'url': url
                }

CodePudding user response:

import pandas as pd


class Url_Spider(scrapy.Spider):
    name = 'url_page'

    def start_requests(self):
        df = pd.read_csv('list.csv')

        urls = df['link']
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse (self, response):
        '''' Parse here what u need ''''
        pass
  • Related