I want to get the thread-id from my urls in one pattern. The pattern should hat just one group (on level 1). My test Strings are:
https://www.mypage.com/thread-3306-page-32.html
https://www.mypage.com/thread-3306.html
https://www.mypage.com/Thread-String-Thread-Id
So I want a Pattern, that gives me for line 1 and 2 the number 3306 and for the last line "String-Thread-Id"
My current state is .*[t|T]hread-(.*)[\-page.*|.html]
. But it fails at the end after the id. How to do it well? I also solved it like .*Thread-(.*)|.*thread-(\\w ).*
, but this is with two groups not applicable for my java code.
CodePudding user response:
Not knowing if this fits for all situations, but I would try this:
^.*?thread-((?:(?!-page|\.html).)*)
In Java, that could look something like
List<String> matchList = new ArrayList<String>();
Pattern regex = Pattern.compile("^.*?thread-((?:(?!-page|\\.html).)*)", Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE | Pattern.MULTILINE);
Matcher regexMatcher = regex.matcher(subjectString);
while (regexMatcher.find()) {
matchList.add(regexMatcher.group(1));
}
Explanation:
^ # Match start of line
.*? # Match any number of characters, as few as possible
thread- # until "thread-" is matched.
( # Then start a capturing group (number 1) to match:
(?: # (start of non-capturing group)
(?!-page|\.html) # assert that neither "page-" nor ".html" follow
. # then match any character
)* # repeat as often as possible
) # end of capturingn group