Regex: Use \b (word boundary) separator but ignore some characters-CodePudding

Given this example:

s = "Hi, domain: (foo.bar.com) bye"

I'd like to create a regex that matches both word and non-word strings, separately, i.e:

re.findall(regex, s)
# Returns: ["Hi", ", ", "domain", ": (", "foo.bar.com", ") ", "bye"]

My approach was to use the word boundary separator \b to catch any string that is bound by two word-to-non-word switches. From the re module docs:

\b is defined as the boundary between a \w and a \W character (or vice versa)

Therefore I tried as a first step:

regex = r'(?:^|\b).*?(?=\b|$)'
re.findall(regex, s)
# Returns: ["Hi", ",", "domain", ": (", "foo", ".", "bar", ".", "com", ") ", "bye"]

The problem is that I don't want the dot (.) character to be a separator too, I'd like the regex to see foo.bar.com as a whole word and not as three words separated by dots.

I tried to find a way to use a negative lookahead on dot but did not manage to make it work.

Is there any way to achieve that?

I don't mind that the dot won't be a separator at all in the regex, it doesn't have to be specific to domain names.

I looked at Regex word boundary alternative, Capture using word boundaries without stopping at "dot" and/or other characters and Regex word boundary excluding the hyphen but it does not fit my case as I cannot use the space as a separator condition.

Exclude some characters from word boundary is the only one that got me close, but I didn't manage to reach it.

CodePudding user response：

For your example, you could just split on [^\w.] , using a capturing group around it to keep those values in the output:

import re

s = "Hi, domain: (foo.bar.com) bye"
re.split(r'([^\w.] )', s)
# ['Hi', ', ', 'domain', ': (', 'foo.bar.com', ') ', 'bye']

If your string might end or finish with non-word/space characters, you can filter out the resultant empty strings in the list with a comprehension:

s = "!! Hello foo.bar.com, your domain ##"
re.split(r'([^\w.] )', s)
# ['', '!! ', 'Hello', ' ', 'foo.bar.com', ', ', 'your', ' ', 'domain', ' ##', '']
[w for w in re.split(r'([^\w.] )', s) if len(w)]
# ['!! ', 'Hello', ' ', 'foo.bar.com', ', ', 'your', ' ', 'domain', ' ##']

CodePudding user response：

You may use this regex in findall:

\w (?:\.\w )*|\W

Which finds a word followed by 0 or more repeats of dot separated words or 1 of non-word characters.

Code:

import re

s = "Hi, domain: (foo.bar.com) bye"
print (re.findall(r'\w (?:\.\w )*|\W ', s))

Output:

['Hi', ', ', 'domain', ': (', 'foo.bar.com', ') ', 'bye']

CodePudding user response：

Lookarounds let you easily say "dot, except if it's surrounded by alphabetics on both sides" if that's what you mean;

re.findall(r'(?:^|\b)(\w (?:\.\w )*|\W )(?!\.\w)(?=\b|$)', s)

or simply "word boundary, unless it's a dot":

re.findall(r'(?:^|(?<!\.)\b(?!\.)). ?(?=(?<!\.)\b(?!\.)|$)', s)

Notice that the latter will join text across a word boundary if it's a dot; so, for example, 'bye. ' would be extracted as one string.

(Perhaps try to be more precise about your requirements?)

Demo: https://ideone.com/dvQhFO