Below is the code and output. I have tried looking up the issue but most I could find are these links
https://stackoverflow.com/questions/70444479/scrapy-not-able-to-scrape-ratings-data-on-amazon https://www.quora.com/Why-cant-Amazon-prices-be-scraped
Any help is appreciated thank you
Spider Code
import json
import scrapy
from ..items import AmazontutorialItem
class AmazonspiderSpider(scrapy.Spider):
name = 'amazonSpider'
allowed_domains = ['amazon.in']
start_urls = \
[
'https://www.amazon.in/gp/bestsellers/videogames/1376528031/ref=zg_bs_nav_videogames_2_1376518031'
]
def parse(self, response):
items = AmazontutorialItem()
productName = response.xpath("//*[contains(concat( ' ', @class, ' ' ), concat( ' ', 'a-link-normal', ' ' ))]//span//div/text()").extract()
productAuthor = response.css('.a-color-base ._cDEzb_p13n-sc-css-line-clamp-1_1Fn1y').css('::text').extract()
productPrice = response.xpath('//*[contains(concat( " ", @class, " " ), concat( " ", "_cDEzb_p13n-sc-price_3mJ9Z", " " ))]/text()').extract()
productImageLink = response.css('.p13n-product-image::attr(src)').extract()
items['productName'] = productName
items['productAuthor'] = productAuthor
items['productPrice'] = productPrice
items['productImageLink'] = productImageLink
yield items
Terminal Output
2022-08-31 14:02:16 [scrapy.utils.log] INFO: Scrapy 2.6.2 started (bot: amazonTutorial)
2022-08-31 14:02:16 [scrapy.utils.log] INFO: Versions: lxml 4.9.1.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 2.0.1, Twisted 22.4.0, Python 3.10.6 (tags/v3.10.6:9c7b4bd, Aug 1 2022, 21:53:49) [MSC v.1932 64 bit (AMD64)], pyOpenSSL 22.0.0 (OpenSSL 3.0.5 5 Jul 2022), cryptography 37.0.4, Platform Windows-10-10.0.22000-SP0
2022-08-31 14:02:16 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'amazonTutorial',
'NEWSPIDER_MODULE': 'amazonTutorial.spiders',
'SPIDER_MODULES': ['amazonTutorial.spiders'],
'USER_AGENT': 'Mozilla/5.0 (compatible; Googlebot/2.1; '
' http://www.google.com/bot.html)'}
2022-08-31 14:02:16 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2022-08-31 14:02:16 [scrapy.extensions.telnet] INFO: Telnet Password: 6adf5086c628832a
2022-08-31 14:02:17 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2022-08-31 14:02:17 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-08-31 14:02:17 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-08-31 14:02:17 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-08-31 14:02:17 [scrapy.core.engine] INFO: Spider opened
2022-08-31 14:02:17 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-08-31 14:02:17 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-08-31 14:02:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.in/gp/bestsellers/videogames/1376528031/ref=zg_bs_nav_videogames_2_1376518031> (referer: None)
2022-08-31 14:02:19 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.in/gp/bestsellers/videogames/1376528031/ref=zg_bs_nav_videogames_2_1376518031>
{'productAuthor': ['ACTIVISION',
'Electronic Arts',
'Generic',
'Electronic Arts',
'ACTIVISION',
'ROCKSTAR GAMES',
'REES52',
'ROCKSTAR GAMES',
'Microsoft',
'ACTIVISION',
'Generic',
'Generic',
'Square Enix',
'UBI Soft',
'Rockstar North',
'Paradox Interactive',
'Generic',
'Generic',
'ADGAMES',
'Blizzard Entertainment',
'UBI Soft',
'Generic',
'Valve',
'Generic',
'Excalibur by Unlimited',
'UBI Soft',
'SEGA',
'ADGAMES',
'ACTIVISION',
'Generic',
'AD Games',
'Bluehole Studio Inc., PUBG Corporation',
'UBI Soft',
'Bethesda',
'Eidos',
'AG Gaming',
'ACTIVISION',
'ACTIVISION',
'ROCKSTAR GAMES',
'Generic',
'Generic',
'ADGAMES',
'Warner Bros.',
'Generic',
'UBI Soft',
'2K GAMES',
'Bethesda',
'Codemasters'],
'productImageLink': ['https://images-eu.ssl-images-amazon.com/images/I/91Wjtmyrg9L._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/81IXtVuvlmL._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/714zMHvejkL._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/81yegjdGUjL._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/51KuZ6TnmfL._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/71mNkKmd3JL._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/41SI-1pARKL._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/51BOMq 7w7L._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/51djUfKMJyL._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/81yhTa3zjlL._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/819Nhgz 3NL._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/61DSfTeIAdS._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/81r70W2EVRL._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/516v7vChU8L._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/81PR8qtHJJL._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/51VV5Z8M5KL._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/91wL7h6OX6L._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/31lL8a0n17L._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/51T43LR-tlL._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/81kTM28TXpL._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/A1iEiu4PEJL._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/51ewfEKk2vL._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/510G-36LZWL._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/81L8-mjNlrL._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/61bDL5UUuFL._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/71nrt t8bAL._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/51eu5RaeIvL._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/81beRvbvv1L._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/51rlW7AK2xL._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/51kYpa4lksL._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/51tgtEXNi9L._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/811qcvGij2L._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/91vFfJh2IbL._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/71H7c4DPQEL._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/81qTUih-eUL._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/41afQxgahrL._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/81ViUDBvP L._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/51gZP1Yh1nL._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/61y3yx53X2L._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/51GhmLgOLPL._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/615H16JHVSL._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/51kXWEFYqPL._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/41zuts4V8EL._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/81Xa2jR5ApL._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/813H5aEHrdL._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/31WujBtNSbL._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/71LZ7amiLOL._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/71S9QD541VL._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/51SUf67G2QL._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/81LXTC16qBL._AC_UL300_SR300,200_.jpg'],
'productName': ['ACTIVISION Call of Duty: WWII (PS4)',
'FIFA 23 for PC',
'GTA (Combo) PC Game - Digital Download (No Online '
'Multiplayer/No REDEEM* Code) - | NO DVD NO CD |',
'FIFA 22 (PC)',
'Spider-Man: Friend or Foe (PC DVD)',
'PC GTA 4 (PC)',
'Waveshare Handheld Game HAT for Raspberry Pi '
'4B/3B /3B/2B/B /A /Zero/Zero W Portable Game Console Gameboy '
'3.5inch IPS Screen on Board Gamepad Joystick, Smoothly '
'Display',
'Grand Theft Auto: Episodes From Liberty City (PC)',
'Age of Empires IV: Standard - Windows 10 (Digital Code) (PC)',
'Call of Duty: WWII (Xbox One)',
'Project IGI (2000) Offline PC Game',
'G-T-A-San-Andrea - (Digital Download) Full PC Game - (NO DVD '
'NO CD) - (NO ONLINE MULTIPLAYER MODE) - PC.',
'Sleeping Dogs - Definitive Edition',
"Tom Clancy's Rainbow Six Siege (PC)",
'Grand Theft Auto V - PC - (ROCKSTAR SOCIAL CLUB DOWNLOAD '
'CODE-NO CD/DVD)',
'Take Command: Second Manassas',
'Trick (pc)',
'Hitman 2: Silent Assassin (2002) Offline PC Game',
'EPC Games: AOE (1,2 & 3) (Digital Download) No DVD/CD (No '
'Online Multiplayer) - Single Player Mode (PC Game)',
'Assassin Creed III Pc Game DVD For Windows Full Setup '
'Offline',
'World of Warcraft (PC/Mac)',
"Assassin's Creed IV: Black Flag (PS4)",
'GTA-San Andraes (PC GAME) - PC Download - [No Multiplayer/No '
'Redeem* Code] - | *NO DVD NO CD* | - WIN 10/11',
'Counter-strike: Global Offensive (PC)',
'Empire Earth (2001) Offline PC Game',
'Train Simulator 2015 (PC)',
"Tom Clancy's: Rainbow Six Siege (Free PS5 Upgrade)",
'Total War : Three Kingdoms Royal Edition (PC)',
'Total_OverDose Pc Game Dvd (Windows)',
'Call of Duty: Black Ops II (PS3)',
'Generic SpideMan 3 PC GAME- (Digital Download) - [ NO DVD NO '
'CD - NO ONLINE MULTIPLAYER/NO ACTIVATION CODE* ] - PC',
'JUST_CUSE_2 PC GAME DVD',
"Player Unknown's Battle Grounds -PUBG (Code in the Box)",
'Driver 3/Driver: Parallel Lines (PC)',
'Fallout 3 - Game of the Year Edition (PC Code)',
'SEKEIRO: SHADOWS DIE TWICE – GOTY EDITION (PC GAME) - '
'Digital Download (No Online Multiplayer/No REDEEM* Code) - | '
'NO DVD NO CD |',
'Kane and Lynch 2: Dog Days (PC DVD)',
'GA Retails - Battlefield Hardline Action Adventure Standard '
'Edition Offline PC Game (for PC)',
'ACTIVISION Call of Duty: WWII (PS4) UBI Soft Far Cry 5 (PS4)',
'Activision Blizzard Inc Call of Duty WWII - PlayStation 4 '
'Standard Edition',
'Grand Theft Auto: Vice City (PC)',
'Commandos: Behind Enemy Lines (1998) Offline PC Game',
'Command & Conquer: Generals – Zero Hour (2003) Offline PC '
'Game',
'Assassin Creed 4 Black Flag Pc Game DVD For Windows',
'Fear 2: Project Origin (PC)',
'Delta Force 2 (1999) Offline PC Game',
"Tom Clancy's: The Division (PS4)",
'The Bureau Xcom Declassified (PC)',
'Prey - PlayStation 4',
'F1: 2013 (PC)'],
'productPrice': []}
2022-08-31 14:02:19 [scrapy.core.engine] INFO: Closing spider (finished)
2022-08-31 14:02:19 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 325,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 53976,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 1.822063,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 8, 31, 8, 32, 19, 119698),
'httpcompression/response_bytes': 315618,
'httpcompression/response_count': 1,
'item_scraped_count': 1,
'log_count/DEBUG': 3,
'log_count/INFO': 10,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2022, 8, 31, 8, 32, 17, 297635)}
2022-08-31 14:02:19 [scrapy.core.engine] INFO: Spider closed (finished)
CodePudding user response:
It seems to be that your xpath expression is correct but the hindrance arises from cookies and user-agent. If you inject user-agen and cookies as headers then it should work
Example:
import scrapy
#from ..items import AmazontutorialItem
class AmazonspiderSpider(scrapy.Spider):
name = 'amazonSpider'
def start_requests(self):
url ='https://www.amazon.in/gp/bestsellers/videogames/1376528031/ref=zg_bs_nav_videogames_2_1376518031'
headers= {
'cookie': 'session-id=261-0423817-4284533; i18n-prefs=INR; ubid-acbin=260-9536726-4137347; csm-hit=tb:A66BGQMC56131P7KZ9YG s-A66BGQMC56131P7KZ9YG|1661938038725&t:1661938038725&adb:adblk_yes; session-token=cjJGeI4cpAcgaeDeO1s0KCC/G7QgNmdopz0rJ36VnXSj0STq5jsO91q2WNtml5LSD5wDG9wcSlfvPhI6WODNbkLHB 6 SQuH5S9tmWavNapCmLU2AG3Hgiw1Wddq9cbv0dXRTFgyXEEo02ivmXUTNvs5PSNQTRVOhGdVy1Z6gjfmJhfxbY WWNLeuCwfTU3DqW4pw cZwUS97Q0CU2axkSVijOpXybC/; session-id-time=2082758401l',
'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36'
}
yield scrapy.Request(
url=url,
headers=headers,
callback= self.parse
)
def parse(self, response):
items = {} #AmazontutorialItem()
productName = response.xpath("//*[contains(concat( ' ', @class, ' ' ), concat( ' ', 'a-link-normal', ' ' ))]//span//div/text()").extract()
productAuthor = response.css('.a-color-base ._cDEzb_p13n-sc-css-line-clamp-1_1Fn1y').css('::text').extract()
productPrice = response.xpath('//*[@]//text()').getall()
productImageLink = response.css('.p13n-product-image::attr(src)').getall()
#items['productName'] = productName
#items['productAuthor'] = productAuthor
items['productPrice'] = productPrice
#items['productImageLink'] = productImageLink
yield items
Output:
{'productPrice': ['₹2,299.00', '₹3,499.00', '₹666.62', '₹1,149.00', '₹2,348.00', '₹499.00', '₹529.00', '₹4,499.00', '₹899.00', '₹3,569.00', '₹1,789.00', '₹199.00', '₹549.00', '₹1,088.00', '₹1,595.00', '₹95.00', '₹299.00', '₹299.00', '₹1,251.00', '₹399.00', '₹399.00', '₹199.00', '₹1,299.00', '₹490.00', '₹499.00']}