What I need
I have a list of img src link. Here is an example:
https://studiocake.kiev.ua/wp-content/webpc-passthru.php?src=https://studiocake.kiev.ua/wp-content/uploads/photo_2020-12-27_12-18-00-2-333x444.jpg&nocache=1
https://studiocake.kiev.ua/wp-content/webpc-passthru.php?src=https://studiocake.kiev.ua/wp-content/uploads/IMG_4945-333x444.jpeg&nocache=1
https://studiocake.kiev.ua/wp-content/webpc-passthru.php?src=https://studiocake.kiev.ua/wp-content/uploads/tri-shokolada.png&nocache=1
I need get the following result:
studiocake.kiev.ua/wp-content/uploads/photo_2020-12-27_12-18-00-2-333x444.jpg
studiocake.kiev.ua/wp-content/uploads/IMG_4945-333x444.jpeg
studiocake.kiev.ua/wp-content/uploads/tri-shokolada.png
Problem
I use the following regex:
studiocake\.kiev\.ua.*(jpeg|png|jpg)
But it doesn't work the way I need. Instead of the result I need, I get link like:
studiocake.kiev.ua/wp-content/webpc-passthru.php?src=https://studiocake.kiev.ua/wp-content/uploads/photo_2020-12-27_12-18-00-2-333x444.jpg
Question
How can I get the result I need with Python regex
CodePudding user response:
You can let a greedy .*
consume the starting match and capture the latter.
import re
matches = re.findall(r"(?i).*\b(studiocake\.kiev\.ua\S*\b(?:jpeg|png|jpg))\b", s)
See this demo at regex101 (matches in group 1) or a Python demo at tio.run
Inside used \S*
to match any amount of characters other than a whitespace.
I further added some \b
word boundaries and the (?i)
-flag for ignore case.
CodePudding user response:
What you want to achieve, is a standard operation on URLs, and python has good number of libraries to achieve that. Instead of using regexes for this exercise, I would recommend using a url parsing library, which provides standard operations, and provides better code.
from urllib.parse import urlparse, parse_qs
def extractSrc(strUrl):
# Parse original URL using urllib
parsed_url = urlparse(strUrl)
# Find the value of query parameter img
src_value = parse_qs(parsed_url.query)['src'][0]
# Again, using same library, parse img url which we got above.
img_parsed_url = urlparse(src_value)
# Remove the scheme in the img URL and return result.
scheme = "%s://" % img_parsed_url.scheme
return img_parsed_url.geturl().replace(scheme, '', 1)
urls = '''https://studiocake.kiev.ua/wp-content/webpc-passthru.php?src=https://studiocake.kiev.ua/wp-content/uploads/photo_2020-12-27_12-18-00-2-333x444.jpg&nocache=1
https://studiocake.kiev.ua/wp-content/webpc-passthru.php?src=https://studiocake.kiev.ua/wp-content/uploads/IMG_4945-333x444.jpeg&nocache=1
https://studiocake.kiev.ua/wp-content/webpc-passthru.php?src=https://studiocake.kiev.ua/wp-content/uploads/tri-shokolada.png&nocache=1'''
for u in urls.split('\n'):
print(extractSrc(u))
Output:
studiocake.kiev.ua/wp-content/uploads/photo_2020-12-27_12-18-00-2-333x444.jpg
studiocake.kiev.ua/wp-content/uploads/IMG_4945-333x444.jpeg
studiocake.kiev.ua/wp-content/uploads/tri-shokolada.png
CodePudding user response:
My hack expression is this:
(https://)(studiocake\.kiev\.ua.*(php)\?src=https://)(studiocake\.kiev\.ua.*(jpeg|png|jpg))(&nocache=1)
To replace it with $4
Explanation...
I just selected all the link in parts and then replaced it with the particular part needed.