how to match an url but not contain a dot-CodePudding

I'm trying to group and match parts of URL with the following code:

pattern = '(http|https\:\/\/)([a-zA-Z0-9\-\.] \.)([a-zA-Z]{2,3})'
re.search(pattern, 'https://www.university.edu/').groups()
# what I got is ('https://', 'www.university.', 'edu')
# but what I expect is ('https://', 'www.university', 'edu')

As is shown above, for the second part, currently I can only get characters plus a ., but how can change my code so that there is no dot in the second part?

Thank you!

CodePudding user response：

import re
pattern = '(http|https:\/\/)([a-zA-Z0-9\-\.] )\.([a-zA-Z]{2,3})'
print(re.search(pattern, 'https://www.university.edu/').groups())

CodePudding user response：

You can use findall with the following regular expression, with general (g), multiline (m) and case indifferent (i) flags set:

^https?:\/\/|[a-z\d .-] (?=\.)|(?<=\.)[a-z]{2,3}(?=\/?$)

Regex demo_{^<¯\(ツ)/¯^>}Python demo

Note that the last example at the regex demo link illustrates that this expression does not check the correctness of the string format. This is no doubt one of the reasons for @DeepSpace's comment on the question.

The expression can be broken down as follows (alternatively, hover the cursor over each element of the expression at the regex link to obtain an explanation of its function).

^http        # match a literal
s?           # optionally match 's'
:\/\/        # match a literal
|            # or
[a-z\d .-]   # match one or more of the indicated characters
(?=\.)       # positive lookahead asserts that previous match is 
             # followed by a period
|            # or
[a-z]{2,3}   # match two or three letters
(?=\/?$)     # positive lookahead asserts previous match is 
             # followed by '/' at the end of the line or
             # by the end of the line