Home > Enterprise >  Regex to pick a value from url
Regex to pick a value from url

Time:03-25

I am having difficulty to build a regex which can extract a value from the URL. The condition is get the value between after last "/" and ".html" Please help

Sample URL1 - https://www.example.com/fgf/sdf/sdf/as/dwe/we/bingo.html - The value I want to extract is bingo

Sample URL2 - www.example.com/we/b345g.html - The value I want to extract is b345g

I tried to build a regex and I was able to get "bingo.html" and "b345g.html using [^\/] $ but was not able to remove or skip ".html"

CodePudding user response:

Here you are:

\/([^\/] ?)(?>\.. )?$

Explaination:

  • \/ - literal character '/'
  • ([^\/] ?) - first group: at least one character that is not a '/' with greedyness (match only the first expansion)
    • [^\/] - any character that is not a '/'
    • - at least one occurence
    • ? - greediness operator (match only first expansion)
  • (?>\.. )? - second optional group: '.' any character (like '.html' or '.exe' or '.png')
    • ?> - non-capturing lookahead group (exclude the content from the result)
    • \. - literal character '.'
    • . - any character (except line terminators)
    • - at least one occurence
    • ? - optionality (note that this one is outside the parenthesis)
  • $ - end of the string

If you want also to exclude query strings you can expand it like this:

\/([^\/] ?)(?>\.. )?(?>\?.*)?$

If you also need to remove the protocol part of the url you can use this:

(?<!\/)\/([^\/] ?)(?>\.. )?(?>\?.*)?$

Where this (?<!\/) just look if there are no '/' before the start of the match

CodePudding user response:

You are only matching using [^\/] $ but not differentiating between the part before and after the dot.

To make that different, you could use for example a capture group to get the part after the last slash and before the first dot.

\S*\/([^\/\s.] )\.[^\/\s] $
  • \S*\/ Match optional non whitespace chars till the last occurrence of /
  • ([^\/\s.] ) Capture group 1 Match 1 times any char except a / whitespace char or .
  • \. Match a dot
  • [^\/\s] Match 1 times any char except a / whitespace char or .
  • $ End of string

See a regex demo.

  • Related