Given a block of arbitrary text, I need a regex pattern that will find/extract domains only, ignoring scheme and subdomain components of domains, and ignoring strings entirely if there is a path (these are being extracted as URLs)
Example Text:
www.google.com/example
www.stackoverflow.com
https://reddit.com
https://www.facebook.com/username
Matches:
reddit.com
stackoverflow.com
I have tried the following
\b((?=[a-z0-9-]{1,63}\.)(xn--)?[a-z0-9] (-[a-z0-9] )*\.) [a-z]{2,63}\b
Which of course will return:
www.google.com
www.stackoverflow.com
reddit.com
www.facebook.com
CodePudding user response:
You can use
\b(?!www\.)(?:(?=[a-z0-9-]{1,63}\.)(?:xn--)?[a-z0-9] (?:-[a-z0-9] )*\.) [a-z]{2,63}\b(?![/.])
See the regex demo.
Details:
\b
- a word boundary(?!www\.)
- nowww.
immediately on the right is allowed(?:(?=[a-z0-9-]{1,63}\.)(?:xn--)?[a-z0-9] (?:-[a-z0-9] )*\.)
- one or more occurrences of(?=[a-z0-9-]{1,63}\.)
- a positive lookahead that requires 1 to 63 ASCII lowercase letters, digits or hyphens and then a.
immediately to the right of the current location(?:xn--)?
- an optionalxn--
char sequence[a-z0-9]
- one or more lowercase ASCII letters or digits(?:-[a-z0-9] )*
- zero or more sequences of-
and one or more lowercase ASCII letters or digits\.
- a.
char
[a-z]{2,63}
- 2 to 63 lowercase ASCII letters\b
- a word boundary(?![/.])
- a negative lookahead that fails the match if there is a/
or.
immediately to the right of the current location.
CodePudding user response:
import re
text = '''www.google.com/example
www.stackoverflow.com
https://reddit.com
https://www.facebook.com/username
'''
re.findall(r'(?<=(?:www.|tps:))[/]*([a-z] .com)(?![/])', text)
['stackoverflow.com', 'reddit.com']