Home > Software design >  Scrapy only scraping and crawling HTML and TXT
Scrapy only scraping and crawling HTML and TXT

Time:04-22

For learning purposes, I've been trying to recursively crawl and scrape all URLs on https://triniate.com/images/, but it seems that Scrapy only wants to crawl and scrape TXT, HTML, and PHP URLs.

Here is my spider code

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.selector import Selector
from HelloScrapy.items import PageInfoItem

class HelloSpider(CrawlSpider):
    #Identifier when executing scrapy from CLI
    name = 'hello'
    #Domains that allow spiders to explore
    allowed_domains = ["triniate.com"]
    #Starting point(Start exploration)URL

    start_urls = ["https://triniate.com/images/"]
    #Specific rule with LinkExtractor argument(For example, scrape only pages that include new in the URL)Can be specified, but this time there is no argument because it targets all pages
    #When you download a page that matches the Rule, the function specified in callback will be called.
    #If follow is set to True, the search will be performed recursively.
    rules = [Rule(LinkExtractor(), callback='parse_pageinfo', follow=True)]
    
    def parse_pageinfo(self, response):
        item = PageInfoItem()
        item['URL'] = response.url
            #Specify which part of the page to scrape
            #In addition to specifying in xPath format, it is also possible to specify in CSS format
        item['title'] = "idc"
        return item

items.py contains

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


from scrapy.item import Item, Field

class PageInfoItem(Item):
    URL = Field()
    title = Field()
    pass

and the console output is

2022-04-21 22:30:50 [scrapy.core.engine] INFO: Closing spider (finished)
2022-04-21 22:30:50 [scrapy.extensions.feedexport] INFO: Stored json feed (175 items) in: haxx.json
2022-04-21 22:30:50 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 59541,
 'downloader/request_count': 176,
 'downloader/request_method_count/GET': 176,
 'downloader/response_bytes': 227394,
 'downloader/response_count': 176,
 'downloader/response_status_count/200': 176,
 'dupefilter/filtered': 875,
 'elapsed_time_seconds': 8.711563,
 'feedexport/success_count/FileFeedStorage': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2022, 4, 22, 3, 30, 50, 142416),
 'httpcompression/response_bytes': 402654,
 'httpcompression/response_count': 175,
 'item_scraped_count': 175,
 'log_count/DEBUG': 357,
 'log_count/INFO': 11,
 'request_depth_max': 5,
 'response_received_count': 176,
 'scheduler/dequeued': 176,
 'scheduler/dequeued/memory': 176,
 'scheduler/enqueued': 176,
 'scheduler/enqueued/memory': 176,
 'start_time': datetime.datetime(2022, 4, 22, 3, 30, 41, 430853)}
2022-04-21 22:30:50 [scrapy.core.engine] INFO: Spider closed (finished)

Could someone please suggest how I should change my code to reflect my desired results?

EDIT: To clarify, I am trying to fetch the URL, not the image or file itself.

CodePudding user response:

To do this you need to know how Scrapy works. First you should write a spider to recursively crawl all the directories from the root URL. And while it is visiting pages extract all the images links.

So I wrote this code for you and tested it on the website you have provided. It perfectly works.

import scrapy

class ImagesSpider(scrapy.Spider):
    name = "images"
    image_ext = ['png', 'gif']

    images_urls = set()

    def start_requests(self):
        yield scrapy.Request(url='https://triniate.com/images/', callback=self.get_images)


    def get_images(self, response):
        all_hrefs = response.css('a::attr(href)').getall()
        all_images_links = list(filter(lambda x: x.split('.')[-1] in self.image_ext, all_hrefs))
        
        for link in all_images_links:
            self.images_urls.add(link)
            yield {'link': f'{response.request.url}{link}'}
            
        next_page_links =  list(filter(lambda x: x[-1]=='/', all_hrefs))
        for link in next_page_links:
            yield response.follow(link, callback=self.get_images)

So this way you have all the links of all of the images provided on this page and any inside directories (recursively).

The get_images method searches for any images in the page. It gets all the images links and then also put any directory links to crawl after. So it gets all the images links of all directories.

The code I provided results in this which has all the links you want:

[
   {"link": "https://triniate.com/images/ChatIcon.png"},
   {"link": "https://triniate.com/images/Sprite1.gif"},
   {"link": "https://triniate.com/images/a.png"},
   ...
   ...
   ...
   {"link": "https://triniate.com/images/objects/house_objects/workbench.png"}
]

Note: I specified the extensions of image files in the image_ext attribute. You can extend it to all image extensions available or just include the extensions that exist in the website like I did. Your choice.

CodePudding user response:

I tried it using basic spider along with scrapy selenium. And it works.

basic.py

import scrapy
from scrapy_selenium import SeleniumRequest
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager


class BasicSpider(scrapy.Spider):
    name = 'basic'
    allowed_domains = ['triniate.com']

    def start_requests(self):
        driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
        driver.set_window_size(1920, 1080)
        driver.get("https://triniate.com/images/")

        links = driver.find_elements(By.XPATH, "//html/body/table/tbody/tr/td[2]/a")

        for link in links:
            href= link.get_attribute('href')
            yield SeleniumRequest(
            url = href
            )
            
        driver.quit()
        return super().start_requests()

    def parse(self, response):
        yield {
            'URL': response.url
        }

settings.py

added

DOWNLOADER_MIDDLEWARES = {
    'scrapy_selenium.SeleniumMiddleware': 800
}

output

2022-04-22 12:03:51 [scrapy.core.scraper] DEBUG: Scraped from <200 https://triniate.com/images/stand_right.gif>
{'URL': 'https://triniate.com/images/stand_right.gif'}
2022-04-22 12:03:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://triniate.com/images/walk_right_transparent.gif> (referer: None)
2022-04-22 12:03:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://triniate.com/images/walk_back.gif>
{'URL': 'https://triniate.com/images/walk_back.gif'}
2022-04-22 12:03:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://triniate.com/images/walk_left_transparent.gif>
{'URL': 'https://triniate.com/images/walk_left_transparent.gif'}
2022-04-22 12:03:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://triniate.com/images/walk_front_transparent.gif>
{'URL': 'https://triniate.com/images/walk_front_transparent.gif'}
2022-04-22 12:03:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://triniate.com/images/walk_back_transparent.gif>
{'URL': 'https://triniate.com/images/walk_back_transparent.gif'}
2022-04-22 12:03:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://triniate.com/images/walk_right.gif>
{'URL': 'https://triniate.com/images/walk_right.gif'}
2022-04-22 12:03:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://triniate.com/images/walk_right_transparent.gif>
{'URL': 'https://triniate.com/images/walk_right_transparent.gif'}
2022-04-22 12:03:52 [scrapy.core.engine] INFO: Closing spider (finished)
  • Related