Home > Mobile >  Python regex to get the closest match without duplicated content
Python regex to get the closest match without duplicated content

Time:12-01

What I need

I have a list of img src link. Here is an example:

  • https://studiocake.kiev.ua/wp-content/webpc-passthru.php?src=https://studiocake.kiev.ua/wp-content/uploads/photo_2020-12-27_12-18-00-2-333x444.jpg&nocache=1
  • https://studiocake.kiev.ua/wp-content/webpc-passthru.php?src=https://studiocake.kiev.ua/wp-content/uploads/IMG_4945-333x444.jpeg&nocache=1
  • https://studiocake.kiev.ua/wp-content/webpc-passthru.php?src=https://studiocake.kiev.ua/wp-content/uploads/tri-shokolada.png&nocache=1

I need get the following result:

studiocake.kiev.ua/wp-content/uploads/photo_2020-12-27_12-18-00-2-333x444.jpg

studiocake.kiev.ua/wp-content/uploads/IMG_4945-333x444.jpeg

studiocake.kiev.ua/wp-content/uploads/tri-shokolada.png

Problem

I use the following regex:

studiocake\.kiev\.ua.*(jpeg|png|jpg)

But it doesn't work the way I need. Instead of the result I need, I get link like:

studiocake.kiev.ua/wp-content/webpc-passthru.php?src=https://studiocake.kiev.ua/wp-content/uploads/photo_2020-12-27_12-18-00-2-333x444.jpg 

Question

How can I get the result I need with Python regex

CodePudding user response:

You can let a greedy .* consume the starting match and capture the latter.

import re

matches = re.findall(r"(?i).*\b(studiocake\.kiev\.ua\S*\b(?:jpeg|png|jpg))\b", s)

See this demo at regex101 (matches in group 1) or a Python demo at tio.run


Inside used \S* to match any amount of characters other than a whitespace.
I further added some \b word boundaries and the (?i)-flag for ignore case.

CodePudding user response:

What you want to achieve, is a standard operation on URLs, and python has good number of libraries to achieve that. Instead of using regexes for this exercise, I would recommend using a url parsing library, which provides standard operations, and provides better code.

from urllib.parse import urlparse, parse_qs


def extractSrc(strUrl):
  # Parse original URL using urllib
  parsed_url = urlparse(strUrl)

  # Find the value of query parameter img
  src_value = parse_qs(parsed_url.query)['src'][0]
  
  # Again, using same library, parse img url which we got above.
  img_parsed_url = urlparse(src_value)

  # Remove the scheme in the img URL and return result.
  scheme = "%s://" % img_parsed_url.scheme
  return img_parsed_url.geturl().replace(scheme, '', 1)



urls = '''https://studiocake.kiev.ua/wp-content/webpc-passthru.php?src=https://studiocake.kiev.ua/wp-content/uploads/photo_2020-12-27_12-18-00-2-333x444.jpg&nocache=1
https://studiocake.kiev.ua/wp-content/webpc-passthru.php?src=https://studiocake.kiev.ua/wp-content/uploads/IMG_4945-333x444.jpeg&nocache=1
https://studiocake.kiev.ua/wp-content/webpc-passthru.php?src=https://studiocake.kiev.ua/wp-content/uploads/tri-shokolada.png&nocache=1'''

for u in urls.split('\n'):
  print(extractSrc(u))

Output:

studiocake.kiev.ua/wp-content/uploads/photo_2020-12-27_12-18-00-2-333x444.jpg
studiocake.kiev.ua/wp-content/uploads/IMG_4945-333x444.jpeg
studiocake.kiev.ua/wp-content/uploads/tri-shokolada.png

CodePudding user response:

My hack expression is this:

(https://)(studiocake\.kiev\.ua.*(php)\?src=https://)(studiocake\.kiev\.ua.*(jpeg|png|jpg))(&nocache=1)

To replace it with $4

Explanation...

I just selected all the link in parts and then replaced it with the particular part needed.

  • Related