Home > OS >  How to remove suffix from scraped links?
How to remove suffix from scraped links?

Time:03-06

I'm looking for a solution to get full-size images from a website.

By using the code I recently finished through someone's help on stackoverflow, I was able to download both full-size images and down-sized images.

What I want is for all downloaded images to be full-sized.

For example, some image filenames have "-625x417.jpg" as a suffix, and some images don't have it.

https://www.bikeexif.com/1968-harley-davidson-shovelhead (has suffix) https://www.bikeexif.com/harley-panhead-walt-siegl (None suffix)

If this suffix can be removed, then it'll be a full-size image.

https://kickstart.bikeexif.com/wp-content/uploads/2018/01/1968-harley-davidson-shovelhead-625x417.jpg (Scraped) https://kickstart.bikeexif.com/wp-content/uploads/2018/01/1968-harley-davidson-shovelhead.jpg (Full-size image's filename if removed: -625x417)

Considering there's a possibility that different image resolutions exist as filenames, So it needed to be removed in a different size too.

I guess I may need to use regular expressions to filter out '- 3digit x 3digit' from below.

But I really don't have any idea how to do that.

If you can do that, please help me finish this. Thank you!

images_url = selector_article.xpath('//div[@id="content"]//img/@src').getall()   \
             selector_article.xpath('//div[@id="content"]//img/@data-src').getall()

Full Code:

import requests
import parsel
import os

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}

for page in range(1, 310):
    print(f'======= Scraping data from page {page} =======')

    url = f'https://www.bikeexif.com/page/{page}'

    response = requests.get(url, headers=headers)
    selector = parsel.Selector(response.text)

    containers = selector.xpath('//div[@]/div/article[@]')

    for v in containers:

        old_title = v.xpath('.//div[2]/h2/a/text()').get()
        
        if old_title is not None:
            title = old_title.replace(':', ' -').replace('?', '')

        title_url = v.xpath('.//div[2]/h2/a/@href').get()
        print(title, title_url)

        os.makedirs( os.path.join('bikeexif', title), exist_ok=True )

        response_article = requests.get(url=title_url, headers=headers)
        selector_article = parsel.Selector(response_article.text)

        # Need to get full-size images only
        # (* remove if suffix exist, such as -625x417, if different size of suffix exist, also need to remove)
        images_url = selector_article.xpath('//div[@id="content"]//img/@src').getall()   \
                    selector_article.xpath('//div[@id="content"]//img/@data-src').getall()
        print('len(images_url):', len(images_url))

        for img_url in images_url:

            response_image = requests.get(url=img_url, headers=headers)

            filename = img_url.split('/')[-1]
            
            with open( os.path.join('bikeexif', title, filename), 'wb') as f:
                f.write(response_image.content)
                print('Download complete!!:', filename)

CodePudding user response:

I would go with something like this:

import re

url = 'https://kickstart.bikeexif.com/wp-content/uploads/2018/01/1968-harley-davidson-shovelhead-625x417.jpg'

new_url = re.sub('(.*)-\d x\d (\.jpg)', r'\1\2', url)
#https://kickstart.bikeexif.com/wp-content/uploads/2018/01/1968-harley-davidson-shovelhead.jpg

Explanation (see also here):

  • The regular expression is broken into three parts: (.*) means basically any set of characters of any length, the parentheses group them together.
  • -\d x\d means the dash, followed by one or more digits, followed by x followed by 1 or more digits.
  • the last part is simply .jpg, we use the \ because . is a special character with regular expressions and so the slash escapes to know we mean a . rather than "0 or more"

In the second part of the re.sub we have \1\2 which means "whatever was in the first set of parenthesis in the first part" and "whatever was in the second set of parentheses in the first part".

Finally, the last part is just your string that you want to parse.

  • Related