Given this example:
s = "Hi, domain: (foo.bar.com) bye"
I'd like to create a regex that matches both word and non-word strings, separately, i.e:
re.findall(regex, s)
# Returns: ["Hi", ", ", "domain", ": (", "foo.bar.com", ") ", "bye"]
My approach was to use the word boundary separator \b
to catch any string that is bound by two word-to-non-word switches. From the re
module docs:
\b
is defined as the boundary between a\w
and a\W
character (or vice versa)
Therefore I tried as a first step:
regex = r'(?:^|\b).*?(?=\b|$)'
re.findall(regex, s)
# Returns: ["Hi", ",", "domain", ": (", "foo", ".", "bar", ".", "com", ") ", "bye"]
The problem is that I don't want the dot (.
) character to be a separator too, I'd like the regex to see foo.bar.com
as a whole word and not as three words separated by dots.
I tried to find a way to use a negative lookahead on dot but did not manage to make it work.
Is there any way to achieve that?
I don't mind that the dot won't be a separator at all in the regex, it doesn't have to be specific to domain names.
I looked at Regex word boundary alternative, Capture using word boundaries without stopping at "dot" and/or other characters and Regex word boundary excluding the hyphen but it does not fit my case as I cannot use the space as a separator condition.
Exclude some characters from word boundary is the only one that got me close, but I didn't manage to reach it.
CodePudding user response:
For your example, you could just split on [^\w.]
, using a capturing group around it to keep those values in the output:
import re
s = "Hi, domain: (foo.bar.com) bye"
re.split(r'([^\w.] )', s)
# ['Hi', ', ', 'domain', ': (', 'foo.bar.com', ') ', 'bye']
If your string might end or finish with non-word/space characters, you can filter out the resultant empty strings in the list with a comprehension:
s = "!! Hello foo.bar.com, your domain ##"
re.split(r'([^\w.] )', s)
# ['', '!! ', 'Hello', ' ', 'foo.bar.com', ', ', 'your', ' ', 'domain', ' ##', '']
[w for w in re.split(r'([^\w.] )', s) if len(w)]
# ['!! ', 'Hello', ' ', 'foo.bar.com', ', ', 'your', ' ', 'domain', ' ##']
CodePudding user response:
You may use this regex in findall
:
\w (?:\.\w )*|\W
Which finds a word followed by 0 or more repeats of dot separated words or 1 of non-word characters.
Code:
import re
s = "Hi, domain: (foo.bar.com) bye"
print (re.findall(r'\w (?:\.\w )*|\W ', s))
Output:
['Hi', ', ', 'domain', ': (', 'foo.bar.com', ') ', 'bye']
CodePudding user response:
Lookarounds let you easily say "dot, except if it's surrounded by alphabetics on both sides" if that's what you mean;
re.findall(r'(?:^|\b)(\w (?:\.\w )*|\W )(?!\.\w)(?=\b|$)', s)
or simply "word boundary, unless it's a dot":
re.findall(r'(?:^|(?<!\.)\b(?!\.)). ?(?=(?<!\.)\b(?!\.)|$)', s)
Notice that the latter will join text across a word boundary if it's a dot; so, for example, 'bye. '
would be extracted as one string.
(Perhaps try to be more precise about your requirements?)