#!/usr/bin/python3
import requests
import re
def get_html(url):
r = requests.get(url)
return r.text
url = "https://foxnews.com/"
html = get_html(url)
pattern = re.compile(r'\b(?:https?:\/\/)?(?:.*?\.)\b(?:png|jpg|jpeg|gif|svg)')
matches = re.findall(pattern, html)
for match in matches:
print(match)
But it keeps matching things like this:
" data-src="//a57.foxnews.com/hp.foxnews.com/images/2022/05/320/180/adc987321731b78d41a3b061e67ea84c.png
href="https://www.foxnews.com/entertainment/saturday-night-live-medieval-supreme-courts-leaked-draft-opinion-abortion-roe-v-wade"
And so forth and so on. How can I make this get image links and filter the remaining other things out? Look around? I'm stuck here.
CodePudding user response:
You can use
pattern = re.compile(r'\b(?:https?://)?[^\s<>"\']*\.(?:png|jpg|jpeg|gif|svg)\b')
matches = re.findall(pattern, html)
Details:
\b
- word boundary(?:https?://)?
- an optional sequence ofhttp://
orhttps://
[^\s<>"\']*
- zero or more chars other than whitespace,"
,'
,<
and>
\.
- a dot(?:png|jpg|jpeg|gif|svg)
- one of the extensions listed\b
- word boundary
CodePudding user response:
I have collated a set of regular expressions that are available on patterns-finder, for finding URL I have been using the following URL regex:
((ftp|https?):\\/\\/(?:www\\.|(?!www))[a-zA-Z0-9][a-zA-Z0-9-] [a-zA-Z0-9]\\.[^\\s]{2,}|www\\.[a-zA-Z0-9][a-zA-Z0-9-] [a-zA-Z0-9]\\.[^\\s]{2,}|https?:\\/\\/(?:www\\.|(?!www))[a-zA-Z0-9] \\.[^\\s]{2,}|www\\.[a-zA-Z0-9] \\.[^\\s]{2,})
You can notice the presence of ftp protocol. With slight modifications, you can add this list of file extensions (gif|png|jpg|jpeg|webp|svg|psd|bmp|tif) as it supports more common type of images. The final regex will look like this:
\b((ftp|https?):\/\/(?:www\.|(?!www))[a-zA-Z0-9][a-zA-Z0-9-] [a-zA-Z0-9]\.[^\s]{2,}|www\.[a-zA-Z0-9][a-zA-Z0-9-] [a-zA-Z0-9]\.[^\s]{2,}|https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9] \.[^\s]{2,}|www\.[a-zA-Z0-9] \.[^\s]{2,})[^\s]*\.(gif|png|jpg|jpeg|webp|svg|psd|bmp|tif)\b
It will match the following:
https://cdn.analytics.com/wp-content/uploads/2022/04/test.jpg
http://cdn.analytics.com/wp-content/uploads/2022/04/file.jpg
http://cdn.analytics.com/wp-content/uploads/2022/04/.jpg
ftp://www.test.com/wp-content/uploads/2022/04/.jpg
It supports the case when the file name is ".jpg" or ".png", etc. But it will not match the following:
http://a57.foxnews.com/images/2022/05/320/180/1e67ea84c/test.html
http://a57.foxnews.com/images/2022/05/320/180/1e67ea84c/test
http://a57.foxnews.com/images/2022/05/320/180/1e67ea84c/png