Python regular expressions matching question-CodePudding

#!/usr/bin/python3

import requests
import re

def get_html(url):
    r = requests.get(url)
    return r.text

url = "https://foxnews.com/"
html = get_html(url)

pattern = re.compile(r'\b(?:https?:\/\/)?(?:.*?\.)\b(?:png|jpg|jpeg|gif|svg)')
matches = re.findall(pattern, html)

for match in matches:
    print(match)

But it keeps matching things like this:

" data-src="//a57.foxnews.com/hp.foxnews.com/images/2022/05/320/180/adc987321731b78d41a3b061e67ea84c.png

href="https://www.foxnews.com/entertainment/saturday-night-live-medieval-supreme-courts-leaked-draft-opinion-abortion-roe-v-wade"

And so forth and so on. How can I make this get image links and filter the remaining other things out? Look around? I'm stuck here.

CodePudding user response：

You can use

pattern = re.compile(r'\b(?:https?://)?[^\s<>"\']*\.(?:png|jpg|jpeg|gif|svg)\b')
matches = re.findall(pattern, html)

Details:

\b - word boundary
(?:https?://)? - an optional sequence of http:// or https://
[^\s<>"\']* - zero or more chars other than whitespace, ", ', < and >
\. - a dot
(?:png|jpg|jpeg|gif|svg) - one of the extensions listed
\b - word boundary

CodePudding user response：

I have collated a set of regular expressions that are available on patterns-finder, for finding URL I have been using the following URL regex:

((ftp|https?):\\/\\/(?:www\\.|(?!www))[a-zA-Z0-9][a-zA-Z0-9-] [a-zA-Z0-9]\\.[^\\s]{2,}|www\\.[a-zA-Z0-9][a-zA-Z0-9-] [a-zA-Z0-9]\\.[^\\s]{2,}|https?:\\/\\/(?:www\\.|(?!www))[a-zA-Z0-9] \\.[^\\s]{2,}|www\\.[a-zA-Z0-9] \\.[^\\s]{2,})

You can notice the presence of ftp protocol. With slight modifications, you can add this list of file extensions (gif|png|jpg|jpeg|webp|svg|psd|bmp|tif) as it supports more common type of images. The final regex will look like this:

\b((ftp|https?):\/\/(?:www\.|(?!www))[a-zA-Z0-9][a-zA-Z0-9-] [a-zA-Z0-9]\.[^\s]{2,}|www\.[a-zA-Z0-9][a-zA-Z0-9-] [a-zA-Z0-9]\.[^\s]{2,}|https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9] \.[^\s]{2,}|www\.[a-zA-Z0-9] \.[^\s]{2,})[^\s]*\.(gif|png|jpg|jpeg|webp|svg|psd|bmp|tif)\b

It will match the following:

https://cdn.analytics.com/wp-content/uploads/2022/04/test.jpg
http://cdn.analytics.com/wp-content/uploads/2022/04/file.jpg
http://cdn.analytics.com/wp-content/uploads/2022/04/.jpg
ftp://www.test.com/wp-content/uploads/2022/04/.jpg

It supports the case when the file name is ".jpg" or ".png", etc. But it will not match the following:

http://a57.foxnews.com/images/2022/05/320/180/1e67ea84c/test.html
http://a57.foxnews.com/images/2022/05/320/180/1e67ea84c/test
http://a57.foxnews.com/images/2022/05/320/180/1e67ea84c/png