Home > other >  Regex search for ONLY domains, ignoring domain component of URL
Regex search for ONLY domains, ignoring domain component of URL

Time:08-26

Given a block of arbitrary text, I need a regex pattern that will find/extract domains only, ignoring scheme and subdomain components of domains, and ignoring strings entirely if there is a path (these are being extracted as URLs)

Example Text:

www.google.com/example
www.stackoverflow.com
https://reddit.com
https://www.facebook.com/username

Matches:
reddit.com
stackoverflow.com

I have tried the following

\b((?=[a-z0-9-]{1,63}\.)(xn--)?[a-z0-9] (-[a-z0-9] )*\.) [a-z]{2,63}\b

Which of course will return:
www.google.com
www.stackoverflow.com
reddit.com
www.facebook.com

CodePudding user response:

You can use

\b(?!www\.)(?:(?=[a-z0-9-]{1,63}\.)(?:xn--)?[a-z0-9] (?:-[a-z0-9] )*\.) [a-z]{2,63}\b(?![/.])

See the regex demo.

Details:

  • \b - a word boundary
  • (?!www\.) - no www. immediately on the right is allowed
  • (?:(?=[a-z0-9-]{1,63}\.)(?:xn--)?[a-z0-9] (?:-[a-z0-9] )*\.) - one or more occurrences of
    • (?=[a-z0-9-]{1,63}\.) - a positive lookahead that requires 1 to 63 ASCII lowercase letters, digits or hyphens and then a . immediately to the right of the current location
    • (?:xn--)? - an optional xn-- char sequence
    • [a-z0-9] - one or more lowercase ASCII letters or digits
    • (?:-[a-z0-9] )* - zero or more sequences of - and one or more lowercase ASCII letters or digits
    • \. - a . char
  • [a-z]{2,63} - 2 to 63 lowercase ASCII letters
  • \b - a word boundary
  • (?![/.]) - a negative lookahead that fails the match if there is a / or . immediately to the right of the current location.

CodePudding user response:

import re
text = '''www.google.com/example
www.stackoverflow.com
https://reddit.com
https://www.facebook.com/username
'''


re.findall(r'(?<=(?:www.|tps:))[/]*([a-z] .com)(?![/])', text)

['stackoverflow.com', 'reddit.com']
  • Related