Home > Back-end >  Unable to scrape prices data from Amazon.in using scrapy on python
Unable to scrape prices data from Amazon.in using scrapy on python

Time:08-31

Below is the code and output. I have tried looking up the issue but most I could find are these links

https://stackoverflow.com/questions/70444479/scrapy-not-able-to-scrape-ratings-data-on-amazon
https://www.quora.com/Why-cant-Amazon-prices-be-scraped

Any help is appreciated thank you

Spider Code

import json

import scrapy
from ..items import AmazontutorialItem

class AmazonspiderSpider(scrapy.Spider):
    name = 'amazonSpider'
    allowed_domains = ['amazon.in']
    start_urls = \
    [
        'https://www.amazon.in/gp/bestsellers/videogames/1376528031/ref=zg_bs_nav_videogames_2_1376518031'
    ]

    def parse(self, response):
        items = AmazontutorialItem()

        productName = response.xpath("//*[contains(concat( ' ', @class, ' ' ), concat( ' ', 'a-link-normal', ' ' ))]//span//div/text()").extract()
        productAuthor = response.css('.a-color-base ._cDEzb_p13n-sc-css-line-clamp-1_1Fn1y').css('::text').extract()
        productPrice = response.xpath('//*[contains(concat( " ", @class, " " ), concat( " ", "_cDEzb_p13n-sc-price_3mJ9Z", " " ))]/text()').extract()
        productImageLink = response.css('.p13n-product-image::attr(src)').extract()

        items['productName'] = productName
        items['productAuthor'] = productAuthor
        items['productPrice'] = productPrice
        items['productImageLink'] = productImageLink

        yield items

Terminal Output

2022-08-31 14:02:16 [scrapy.utils.log] INFO: Scrapy 2.6.2 started (bot: amazonTutorial)
2022-08-31 14:02:16 [scrapy.utils.log] INFO: Versions: lxml 4.9.1.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 2.0.1, Twisted 22.4.0, Python 3.10.6 (tags/v3.10.6:9c7b4bd, Aug  1 2022, 21:53:49) [MSC v.1932 64 bit (AMD64)], pyOpenSSL 22.0.0 (OpenSSL 3.0.5 5 Jul 2022), cryptography 37.0.4, Platform Windows-10-10.0.22000-SP0
2022-08-31 14:02:16 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'amazonTutorial',
 'NEWSPIDER_MODULE': 'amazonTutorial.spiders',
 'SPIDER_MODULES': ['amazonTutorial.spiders'],
 'USER_AGENT': 'Mozilla/5.0 (compatible; Googlebot/2.1; '
               ' http://www.google.com/bot.html)'}
2022-08-31 14:02:16 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2022-08-31 14:02:16 [scrapy.extensions.telnet] INFO: Telnet Password: 6adf5086c628832a
2022-08-31 14:02:17 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2022-08-31 14:02:17 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-08-31 14:02:17 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-08-31 14:02:17 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-08-31 14:02:17 [scrapy.core.engine] INFO: Spider opened
2022-08-31 14:02:17 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-08-31 14:02:17 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-08-31 14:02:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.in/gp/bestsellers/videogames/1376528031/ref=zg_bs_nav_videogames_2_1376518031> (referer: None)
2022-08-31 14:02:19 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.in/gp/bestsellers/videogames/1376528031/ref=zg_bs_nav_videogames_2_1376518031>
{'productAuthor': ['ACTIVISION',
                   'Electronic Arts',
                   'Generic',
                   'Electronic Arts',
                   'ACTIVISION',
                   'ROCKSTAR GAMES',
                   'REES52',
                   'ROCKSTAR GAMES',
                   'Microsoft',
                   'ACTIVISION',
                   'Generic',
                   'Generic',
                   'Square Enix',
                   'UBI Soft',
                   'Rockstar North',
                   'Paradox Interactive',
                   'Generic',
                   'Generic',
                   'ADGAMES',
                   'Blizzard Entertainment',
                   'UBI Soft',
                   'Generic',
                   'Valve',
                   'Generic',
                   'Excalibur by Unlimited',
                   'UBI Soft',
                   'SEGA',
                   'ADGAMES',
                   'ACTIVISION',
                   'Generic',
                   'AD Games',
                   'Bluehole Studio Inc., PUBG Corporation',
                   'UBI Soft',
                   'Bethesda',
                   'Eidos',
                   'AG Gaming',
                   'ACTIVISION',
                   'ACTIVISION',
                   'ROCKSTAR GAMES',
                   'Generic',
                   'Generic',
                   'ADGAMES',
                   'Warner Bros.',
                   'Generic',
                   'UBI Soft',
                   '2K GAMES',
                   'Bethesda',
                   'Codemasters'],
 'productImageLink': ['https://images-eu.ssl-images-amazon.com/images/I/91Wjtmyrg9L._AC_UL300_SR300,200_.jpg',
                      'https://images-eu.ssl-images-amazon.com/images/I/81IXtVuvlmL._AC_UL300_SR300,200_.jpg',
                      'https://images-eu.ssl-images-amazon.com/images/I/714zMHvejkL._AC_UL300_SR300,200_.jpg',
                      'https://images-eu.ssl-images-amazon.com/images/I/81yegjdGUjL._AC_UL300_SR300,200_.jpg',
                      'https://images-eu.ssl-images-amazon.com/images/I/51KuZ6TnmfL._AC_UL300_SR300,200_.jpg',
                      'https://images-eu.ssl-images-amazon.com/images/I/71mNkKmd3JL._AC_UL300_SR300,200_.jpg',
                      'https://images-eu.ssl-images-amazon.com/images/I/41SI-1pARKL._AC_UL300_SR300,200_.jpg',
                      'https://images-eu.ssl-images-amazon.com/images/I/51BOMq 7w7L._AC_UL300_SR300,200_.jpg',
                      'https://images-eu.ssl-images-amazon.com/images/I/51djUfKMJyL._AC_UL300_SR300,200_.jpg',
                      'https://images-eu.ssl-images-amazon.com/images/I/81yhTa3zjlL._AC_UL300_SR300,200_.jpg',
                      'https://images-eu.ssl-images-amazon.com/images/I/819Nhgz 3NL._AC_UL300_SR300,200_.jpg',
                      'https://images-eu.ssl-images-amazon.com/images/I/61DSfTeIAdS._AC_UL300_SR300,200_.jpg',
                      'https://images-eu.ssl-images-amazon.com/images/I/81r70W2EVRL._AC_UL300_SR300,200_.jpg',
                      'https://images-eu.ssl-images-amazon.com/images/I/516v7vChU8L._AC_UL300_SR300,200_.jpg',
                      'https://images-eu.ssl-images-amazon.com/images/I/81PR8qtHJJL._AC_UL300_SR300,200_.jpg',
                      'https://images-eu.ssl-images-amazon.com/images/I/51VV5Z8M5KL._AC_UL300_SR300,200_.jpg',
                      'https://images-eu.ssl-images-amazon.com/images/I/91wL7h6OX6L._AC_UL300_SR300,200_.jpg',
                      'https://images-eu.ssl-images-amazon.com/images/I/31lL8a0n17L._AC_UL300_SR300,200_.jpg',
                      'https://images-eu.ssl-images-amazon.com/images/I/51T43LR-tlL._AC_UL300_SR300,200_.jpg',
                      'https://images-eu.ssl-images-amazon.com/images/I/81kTM28TXpL._AC_UL300_SR300,200_.jpg',
                      'https://images-eu.ssl-images-amazon.com/images/I/A1iEiu4PEJL._AC_UL300_SR300,200_.jpg',
                      'https://images-eu.ssl-images-amazon.com/images/I/51ewfEKk2vL._AC_UL300_SR300,200_.jpg',
                      'https://images-eu.ssl-images-amazon.com/images/I/510G-36LZWL._AC_UL300_SR300,200_.jpg',
                      'https://images-eu.ssl-images-amazon.com/images/I/81L8-mjNlrL._AC_UL300_SR300,200_.jpg',
                      'https://images-eu.ssl-images-amazon.com/images/I/61bDL5UUuFL._AC_UL300_SR300,200_.jpg',
                      'https://images-eu.ssl-images-amazon.com/images/I/71nrt t8bAL._AC_UL300_SR300,200_.jpg',
                      'https://images-eu.ssl-images-amazon.com/images/I/51eu5RaeIvL._AC_UL300_SR300,200_.jpg',
                      'https://images-eu.ssl-images-amazon.com/images/I/81beRvbvv1L._AC_UL300_SR300,200_.jpg',
                      'https://images-eu.ssl-images-amazon.com/images/I/51rlW7AK2xL._AC_UL300_SR300,200_.jpg',
                      'https://images-eu.ssl-images-amazon.com/images/I/51kYpa4lksL._AC_UL300_SR300,200_.jpg',
                      'https://images-eu.ssl-images-amazon.com/images/I/51tgtEXNi9L._AC_UL300_SR300,200_.jpg',
                      'https://images-eu.ssl-images-amazon.com/images/I/811qcvGij2L._AC_UL300_SR300,200_.jpg',
                      'https://images-eu.ssl-images-amazon.com/images/I/91vFfJh2IbL._AC_UL300_SR300,200_.jpg',
                      'https://images-eu.ssl-images-amazon.com/images/I/71H7c4DPQEL._AC_UL300_SR300,200_.jpg',
                      'https://images-eu.ssl-images-amazon.com/images/I/81qTUih-eUL._AC_UL300_SR300,200_.jpg',
                      'https://images-eu.ssl-images-amazon.com/images/I/41afQxgahrL._AC_UL300_SR300,200_.jpg',
                      'https://images-eu.ssl-images-amazon.com/images/I/81ViUDBvP L._AC_UL300_SR300,200_.jpg',
                      'https://images-eu.ssl-images-amazon.com/images/I/51gZP1Yh1nL._AC_UL300_SR300,200_.jpg',
                      'https://images-eu.ssl-images-amazon.com/images/I/61y3yx53X2L._AC_UL300_SR300,200_.jpg',
                      'https://images-eu.ssl-images-amazon.com/images/I/51GhmLgOLPL._AC_UL300_SR300,200_.jpg',
                      'https://images-eu.ssl-images-amazon.com/images/I/615H16JHVSL._AC_UL300_SR300,200_.jpg',
                      'https://images-eu.ssl-images-amazon.com/images/I/51kXWEFYqPL._AC_UL300_SR300,200_.jpg',
                      'https://images-eu.ssl-images-amazon.com/images/I/41zuts4V8EL._AC_UL300_SR300,200_.jpg',
                      'https://images-eu.ssl-images-amazon.com/images/I/81Xa2jR5ApL._AC_UL300_SR300,200_.jpg',
                      'https://images-eu.ssl-images-amazon.com/images/I/813H5aEHrdL._AC_UL300_SR300,200_.jpg',
                      'https://images-eu.ssl-images-amazon.com/images/I/31WujBtNSbL._AC_UL300_SR300,200_.jpg',
                      'https://images-eu.ssl-images-amazon.com/images/I/71LZ7amiLOL._AC_UL300_SR300,200_.jpg',
                      'https://images-eu.ssl-images-amazon.com/images/I/71S9QD541VL._AC_UL300_SR300,200_.jpg',
                      'https://images-eu.ssl-images-amazon.com/images/I/51SUf67G2QL._AC_UL300_SR300,200_.jpg',
                      'https://images-eu.ssl-images-amazon.com/images/I/81LXTC16qBL._AC_UL300_SR300,200_.jpg'],
 'productName': ['ACTIVISION Call of Duty: WWII (PS4)',
                 'FIFA 23 for PC',
                 'GTA (Combo) PC Game - Digital Download (No Online '
                 'Multiplayer/No REDEEM* Code) - | NO DVD NO CD |',
                 'FIFA 22 (PC)',
                 'Spider-Man: Friend or Foe (PC DVD)',
                 'PC GTA 4 (PC)',
                 'Waveshare Handheld Game HAT for Raspberry Pi '
                 '4B/3B /3B/2B/B /A /Zero/Zero W Portable Game Console Gameboy '
                 '3.5inch IPS Screen on Board Gamepad Joystick, Smoothly '
                 'Display',
                 'Grand Theft Auto: Episodes From Liberty City (PC)',
                 'Age of Empires IV: Standard - Windows 10 (Digital Code) (PC)',
                 'Call of Duty: WWII (Xbox One)',
                 'Project IGI (2000) Offline PC Game',
                 'G-T-A-San-Andrea - (Digital Download) Full PC Game - (NO DVD '
                 'NO CD) - (NO ONLINE MULTIPLAYER MODE) - PC.',
                 'Sleeping Dogs - Definitive Edition',
                 "Tom Clancy's Rainbow Six Siege (PC)",
                 'Grand Theft Auto V - PC - (ROCKSTAR SOCIAL CLUB DOWNLOAD '
                 'CODE-NO CD/DVD)',
                 'Take Command: Second Manassas',
                 'Trick (pc)',
                 'Hitman 2: Silent Assassin (2002) Offline PC Game',
                 'EPC Games: AOE (1,2 & 3) (Digital Download) No DVD/CD (No '
                 'Online Multiplayer) - Single Player Mode (PC Game)',
                 'Assassin Creed III Pc Game DVD For Windows Full Setup '
                 'Offline',
                 'World of Warcraft (PC/Mac)',
                 "Assassin's Creed IV: Black Flag (PS4)",
                 'GTA-San Andraes (PC GAME) - PC Download - [No Multiplayer/No '
                 'Redeem* Code] - | *NO DVD NO CD* | - WIN 10/11',
                 'Counter-strike: Global Offensive (PC)',
                 'Empire Earth (2001) Offline PC Game',
                 'Train Simulator 2015 (PC)',
                 "Tom Clancy's: Rainbow Six Siege (Free PS5 Upgrade)",
                 'Total War : Three Kingdoms Royal Edition (PC)',
                 'Total_OverDose Pc Game Dvd (Windows)',
                 'Call of Duty: Black Ops II (PS3)',
                 'Generic SpideMan 3 PC GAME- (Digital Download) - [ NO DVD NO '
                 'CD - NO ONLINE MULTIPLAYER/NO ACTIVATION CODE* ] - PC',
                 'JUST_CUSE_2 PC GAME DVD',
                 "Player Unknown's Battle Grounds -PUBG (Code in the Box)",
                 'Driver 3/Driver: Parallel Lines (PC)',
                 'Fallout 3 - Game of the Year Edition (PC Code)',
                 'SEKEIRO: SHADOWS DIE TWICE – GOTY EDITION (PC GAME) - '
                 'Digital Download (No Online Multiplayer/No REDEEM* Code) - | '
                 'NO DVD NO CD |',
                 'Kane and Lynch 2: Dog Days (PC DVD)',
                 'GA Retails - Battlefield Hardline Action Adventure Standard '
                 'Edition Offline PC Game (for PC)',
                 'ACTIVISION Call of Duty: WWII (PS4) UBI Soft Far Cry 5 (PS4)',
                 'Activision Blizzard Inc Call of Duty WWII - PlayStation 4 '
                 'Standard Edition',
                 'Grand Theft Auto: Vice City (PC)',
                 'Commandos: Behind Enemy Lines (1998) Offline PC Game',
                 'Command & Conquer: Generals – Zero Hour (2003) Offline PC '
                 'Game',
                 'Assassin Creed 4 Black Flag Pc Game DVD For Windows',
                 'Fear 2: Project Origin (PC)',
                 'Delta Force 2 (1999) Offline PC Game',
                 "Tom Clancy's: The Division (PS4)",
                 'The Bureau Xcom Declassified (PC)',
                 'Prey - PlayStation 4',
                 'F1: 2013 (PC)'],
 'productPrice': []}
2022-08-31 14:02:19 [scrapy.core.engine] INFO: Closing spider (finished)
2022-08-31 14:02:19 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 325,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 53976,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 1.822063,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2022, 8, 31, 8, 32, 19, 119698),
 'httpcompression/response_bytes': 315618,
 'httpcompression/response_count': 1,
 'item_scraped_count': 1,
 'log_count/DEBUG': 3,
 'log_count/INFO': 10,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2022, 8, 31, 8, 32, 17, 297635)}
2022-08-31 14:02:19 [scrapy.core.engine] INFO: Spider closed (finished)

CodePudding user response:

It seems to be that your xpath expression is correct but the hindrance arises from cookies and user-agent. If you inject user-agen and cookies as headers then it should work

Example:

import scrapy
#from ..items import AmazontutorialItem

class AmazonspiderSpider(scrapy.Spider):
    name = 'amazonSpider'
    
    def start_requests(self):
        url ='https://www.amazon.in/gp/bestsellers/videogames/1376528031/ref=zg_bs_nav_videogames_2_1376518031'
        headers= {
            'cookie': 'session-id=261-0423817-4284533; i18n-prefs=INR; ubid-acbin=260-9536726-4137347; csm-hit=tb:A66BGQMC56131P7KZ9YG s-A66BGQMC56131P7KZ9YG|1661938038725&t:1661938038725&adb:adblk_yes; session-token=cjJGeI4cpAcgaeDeO1s0KCC/G7QgNmdopz0rJ36VnXSj0STq5jsO91q2WNtml5LSD5wDG9wcSlfvPhI6WODNbkLHB 6 SQuH5S9tmWavNapCmLU2AG3Hgiw1Wddq9cbv0dXRTFgyXEEo02ivmXUTNvs5PSNQTRVOhGdVy1Z6gjfmJhfxbY WWNLeuCwfTU3DqW4pw cZwUS97Q0CU2axkSVijOpXybC/; session-id-time=2082758401l',
            'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36'
        }
        
        yield scrapy.Request(
            url=url,
            headers=headers,
            callback= self.parse
            )
        
    def parse(self, response):
        items = {} #AmazontutorialItem()

        productName = response.xpath("//*[contains(concat( ' ', @class, ' ' ), concat( ' ', 'a-link-normal', ' ' ))]//span//div/text()").extract()
        productAuthor = response.css('.a-color-base ._cDEzb_p13n-sc-css-line-clamp-1_1Fn1y').css('::text').extract()
        productPrice = response.xpath('//*[@]//text()').getall()
        productImageLink = response.css('.p13n-product-image::attr(src)').getall()

        #items['productName'] = productName
        #items['productAuthor'] = productAuthor
        items['productPrice'] = productPrice
        #items['productImageLink'] = productImageLink

        yield items

Output:

{'productPrice': ['₹2,299.00', '₹3,499.00', '₹666.62', '₹1,149.00', '₹2,348.00', '₹499.00', '₹529.00', '₹4,499.00', '₹899.00', '₹3,569.00', '₹1,789.00', '₹199.00', '₹549.00', '₹1,088.00', '₹1,595.00', '₹95.00', '₹299.00', '₹299.00', '₹1,251.00', '₹399.00', '₹399.00', '₹199.00', '₹1,299.00', '₹490.00', '₹499.00']}
  • Related