I am trying to construct a regular expression that finds all image urls from a string. An image url can be either absolute path or relative.
All these should be valid matches:
../example/test.png
https://www.test.com/abc.jpg
images/test.webp
For example: if we define
inputString="img src=https://www.test.com/abc.jpg background:../example/test.png <div> images/test.webp image.pnghello"
then we should find these 3 matches:
https://www.test.com/abc.jpg
../example/test.png
images/test.webp
I am currently doing this(i am using python) and it only finds absolute path, find only some of the images and also sometimes has bad matches(finds a string that has an image url inside but adds to it a lot of stuff that is after the image url)
imageurls = re.findall(r'(?:"|\')((?:https?://|/)\S \.(?:jpg|png|gif|jpeg|webp))(?:"|\')', inputString)
CodePudding user response:
You can try:
(?i)https?\S (?:jpg|png|webp)\b|[^:<>\s\'\"] (?:jpg|png|webp)\b
import re
s = '''img src=https://www.test.com/abc.jpg background:../example/test.png <div> images/test.webp image.pnghellobackground-image: url('../images/pics/mobile/img.JPG')'''
pat = re.compile(r'(?i)https?\S (?:jpg|png|webp)\b|[^:<>\s\'\"] (?:jpg|png|webp)\b')
for m in pat.findall(s):
print(m)
Prints:
https://www.test.com/abc.jpg
../example/test.png
images/test.webp
../images/pics/mobile/img.JPG
CodePudding user response:
What do you think of that :
re.findall(r'(?=:[^\S])?(?:https?://)?[\./]*[\w/\.] \.(?:jpg|png|gif|jpeg|webp)', inputString)
With:
"img src=http://another.org/hola.gif https://www.test.com/abc.jpg background:../example/test.png <div> images/test.webp image.pnghello"
Gives :
['http://another.org/hola.gif',
'https://www.test.com/abc.jpg',
'../example/test.png',
'images/test.webp',
'image.png']
This probably needs more test samples :)