Regex to pick a value from url-CodePudding

I am having difficulty to build a regex which can extract a value from the URL. The condition is get the value between after last "/" and ".html" Please help

Sample URL1 - https://www.example.com/fgf/sdf/sdf/as/dwe/we/bingo.html - The value I want to extract is bingo

Sample URL2 - www.example.com/we/b345g.html - The value I want to extract is b345g

I tried to build a regex and I was able to get "bingo.html" and "b345g.html using [^\/] $ but was not able to remove or skip ".html"

CodePudding user response：

Here you are:

\/([^\/] ?)(?>\.. )?$

Explaination:

\/ - literal character '/'
([^\/] ?) - first group: at least one character that is not a '/' with greedyness (match only the first expansion)
- [^\/] - any character that is not a '/'
- - at least one occurence
- ? - greediness operator (match only first expansion)
(?>\.. )? - second optional group: '.' any character (like '.html' or '.exe' or '.png')
- ?> - non-capturing lookahead group (exclude the content from the result)
- \. - literal character '.'
- . - any character (except line terminators)
- - at least one occurence
- ? - optionality (note that this one is outside the parenthesis)
$ - end of the string

If you want also to exclude query strings you can expand it like this:

\/([^\/] ?)(?>\.. )?(?>\?.*)?$

If you also need to remove the protocol part of the url you can use this:

(?<!\/)\/([^\/] ?)(?>\.. )?(?>\?.*)?$

Where this (?<!\/) just look if there are no '/' before the start of the match

CodePudding user response：

You are only matching using [^\/] $ but not differentiating between the part before and after the dot.

To make that different, you could use for example a capture group to get the part after the last slash and before the first dot.

\S*\/([^\/\s.] )\.[^\/\s] $

\S*\/ Match optional non whitespace chars till the last occurrence of /
([^\/\s.] ) Capture group 1 Match 1 times any char except a / whitespace char or .
\. Match a dot
[^\/\s] Match 1 times any char except a / whitespace char or .
$ End of string

See a regex demo.