allowed_domains = ['www.google.com','google.com',]
start_urls = ['https://www.google.com/search?q=mobiles&tbm=pts&sxsrf=AJOqlzXrlIIii_GtGMCheGMJHKPpQl1hLw:1673692348905&source=hp&ei=vITCY_2YNOKVxc8P79uA2A8&iflsig=AK50M_UAAAAAY8KSzHAkD8f8N_ul8boy27FJhuidI9c7&ved=0ahUKEwj95qrv7cb8AhXiSvEDHe8tAPsQ4dUDCAg&uact=5&oq=mobiles&gs_lcp=Cg9nd3Mtd2l6LXBhdGVudHMQAzIECCMQJzIFCAAQkQIyBAgAEEMyCggAEIAEEIcCEBQyCAgAEIAEELEDMggIABCABBCxAzILCAAQgAQQsQMQgwEyCAgAEIAEELEDMggIABCABBCxAzILCAAQgAQQsQMQyQM6CAgAELEDEIMBOgUIABCABDoFCAAQsQM6BQgAEJIDUABYygxg1g1oAHAAeACAAfADiAG4DpIBAzQtNJgBAKABAQ&sclient=gws-wiz-patents']
This is parse and other_link function
def parse(self, response):
title = response.xpath("//div[@class='yuRUbf']/a/h3/text()").extract_first()
realetd_data = response.xpath("//div[@class='yuRUbf']/a/@href").get()
yield response.follow(url = realetd_data, callback = self.other_link)
def other_link(self,response):
heading = response.xpath("//div[@class='abstract style-scope patent-text']/text()").get()
yield{
'heading': heading
}
I am getting this
DEBUG: Crawled (200) <GET https://www.google.com/search?q=mobiles&tbm=pts&sxsrf=AJOqlzXrlIIii_GtGMCheGMJHKPpQl1hLw:1673692348905&source=hp&ei=vITCY_2YNOKVxc8P79uA2A8&iflsig=AK50M_UAAAAAY8KSzHAkD8f8N_ul8boy27FJhuidI9c7&ved=0ahUKEwj95qrv7cb8AhXiSvEDHe8tAPsQ4dUDCAg&uact=5&oq=mobiles&gs_lcp=Cg9nd3Mtd2l6LXBhdGVudHMQAzIECCMQJzIFCAAQkQIyBAgAEEMyCggAEIAEEIcCEBQyCAgAEIAEELEDMggIABCABBCxAzILCAAQgAQQsQMQgwEyCAgAEIAEELEDMggIABCABBCxAzILCAAQgAQQsQMQyQM6CAgAELEDEIMBOgUIABCABDoFCAAQsQM6BQgAEJIDUABYygxg1g1oAHAAeACAAfADiAG4DpIBAzQtNJgBAKABAQ&sclient=gws-wiz-patents> (referer: None) 2023-01-14 16:43:26 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.google.com.pk': <GET https://www.google.com.pk/patents/WO2006010333A1?cl=en&dq=mobiles&hl=en&sa=X&ved=2ahUKEwiCmP_c_cb8AhW-qZUCHW4ZABYQ6AF6BAgFEAM> 2023-01-14 16:43:26 [scrapy.core.engine] INFO: Closing spider (finished) 2023-01-14 16:43:26 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
Can You Please help me
CodePudding user response:
allowed_domains = ['www.google.com','google.com', ' https://www.google.com.pk']
This should work, you need to update allowed_domains