I am trying to scrape the image src using scrapy in python but instead, form img element want to scrape from element that has no class attribute or src attribute, can anyone please help me with how to do this, thanks in advance.
<source media="(min-width: 1024px)" sizes="1140px" srcset="https://static1.simpleflyingimages.com/wordpress/wp-content/uploads/2022/09/Thomas-Boon-Air-Canada-2.jpg?q=50&fit=contain&w=1140&h=&dpr=1.5">
the Code I tried for this:
from urllib.parse import urljoin
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from datetime import datetime
import pandas as pd
class NewsSpider(scrapy.Spider):
name = "simpleflying"
def start_requests(self):
url = input("Enter the article url: ")
yield scrapy.Request(url, callback=self.parse_dir_contents)
def parse_dir_contents(self, response):
Feature_Image = [i.strip() for i in response.css('source media="(min-width: 1024px)" ::attr(data-origin-srcset)').getall()][0]
yield{
'Feature_Image': Feature_Image,
}
This is link of the site : https://simpleflying.com/best-airlines-travel-with-babies-young-children/
CodePudding user response:
You can try the next example
import scrapy
class NewsSpider(scrapy.Spider):
name = "articles"
def start_requests(self):
url='https://simpleflying.com/best-airlines-travel-with-babies-young-children/'
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
img_url = response.xpath('//*[@]/figure/picture/img/@data-img-url').get()
yield {
'img_url':img_url
}
Output:
{'img_url': 'https://static1.simpleflyingimages.com/wordpress/wp-content/uploads/2022/09/Thomas-Boon-Air-Canada-2.jpg'}