I am having difficulty to build a regex which can extract a value from the URL. The condition is get the value between after last "/" and ".html" Please help
Sample URL1 - https://www.example.com/fgf/sdf/sdf/as/dwe/we/bingo.html - The value I want to extract is bingo
Sample URL2 - www.example.com/we/b345g.html - The value I want to extract is b345g
I tried to build a regex and I was able to get "bingo.html" and "b345g.html using [^\/] $
but was not able to remove or skip ".html"
CodePudding user response:
Here you are:
\/([^\/] ?)(?>\.. )?$
Explaination:
\/
- literal character '/'([^\/] ?)
- first group: at least one character that is not a '/' with greedyness (match only the first expansion)[^\/]
- any character that is not a '/'?
- greediness operator (match only first expansion)
(?>\.. )?
- second optional group: '.' any character (like '.html' or '.exe' or '.png')?>
- non-capturing lookahead group (exclude the content from the result)\.
- literal character '.'.
- any character (except line terminators)?
- optionality (note that this one is outside the parenthesis)
$
- end of the string
If you want also to exclude query strings you can expand it like this:
\/([^\/] ?)(?>\.. )?(?>\?.*)?$
If you also need to remove the protocol part of the url you can use this:
(?<!\/)\/([^\/] ?)(?>\.. )?(?>\?.*)?$
Where this (?<!\/)
just look if there are no '/' before the start of the match
CodePudding user response:
You are only matching using [^\/] $
but not differentiating between the part before and after the dot.
To make that different, you could use for example a capture group to get the part after the last slash and before the first dot.
\S*\/([^\/\s.] )\.[^\/\s] $
\S*\/
Match optional non whitespace chars till the last occurrence of/
([^\/\s.] )
Capture group 1 Match 1 times any char except a/
whitespace char or.
\.
Match a dot[^\/\s]
Match 1 times any char except a/
whitespace char or.
$
End of string
See a regex demo.