I'm trying to find any number of words at the beginning or end of a string with a maximum of 20 characters.
This is what I have right now:
s1 = "Hello, World! This is a reallly long string"
match = re.search(r"^(\b.{0,20}\b)", s1)
print(f"'{match.group(0)}'") # 'Hello, World! This '
My problem is the extra space that it adds at the end. I believe this is because \b matches either the beginning or the end of the string but I'm not sure what to do about it.
I run into the same issue if I try to do the same with the end of the string but with a leading space instead:
s1 = "Hello, World! This is a reallly long string"
match = re.search(r"(\b.{0,20}\b)$", s1)
print(f"'{match.group(0)}'") # ' reallly long string'
I know I can just use rstrip and lstrip to get rid of the leading/trailing whitespace but I was just wondering if there's a way to do it with regex.
CodePudding user response:
You can use r"^(.{0,19}\S\b|)"
(regex demo), \S
ensuring to have a non space character on the bound. You need to decrease the number of previous characters to 19 and use |
with empty string to match 0 characters if needed:
import re
s1 = "Hello, World! This is a reallly long string"
match = re.search(r"^(.{0,19}\S\b|)", s1)
print(f"'{match.group(0)}'", len(match.group(0)))
Output:
'Hello, World' 15
For the end of string r"(|\b\S.{0,19})$"
(regex demo):
import re
s1 = "Hello, World! This is a reallly long string"
match = re.search(r"(|\b\S.{0,19})$", s1)
print(f"'{match.group(0)}'", len(match.group(0)))
output:
'reallly long string' 19
why (...|)
?
to enable zeros characters, the below example would fail with ^(.{0,19}\S\b)
import re
s1 = "X"*21
match = re.search(r"^(.{0,19}\S\b|)$", s1)
print(f"'{match.group(0)}'", len(match.group(0)))
output:
'' 0
CodePudding user response:
You may use this regex:
^\S.{0,18}\S\b|\b\S.{0,18}\S$
\S
(not a whitespace) at start and end guarantees that your matches start and with with a non-whitespace character.
code:
import re
s = "Hello, World! This is a reallly long string"
print(re.findall(r'^\S.{0,18}\S\b|\b\S.{0,18}\S$', s))
# ['Hello, World', 'reallly long string']