Home > Back-end >  Find the first/last n words of a string with a maximum of 20 characters using regex
Find the first/last n words of a string with a maximum of 20 characters using regex

Time:10-23

I'm trying to find any number of words at the beginning or end of a string with a maximum of 20 characters.

This is what I have right now:

s1 = "Hello,    World! This is a reallly long string"
match = re.search(r"^(\b.{0,20}\b)", s1)
print(f"'{match.group(0)}'") # 'Hello, World! This '

My problem is the extra space that it adds at the end. I believe this is because \b matches either the beginning or the end of the string but I'm not sure what to do about it.

I run into the same issue if I try to do the same with the end of the string but with a leading space instead:

s1 = "Hello,    World! This is a reallly long string"
match = re.search(r"(\b.{0,20}\b)$", s1)
print(f"'{match.group(0)}'") # ' reallly long string'

I know I can just use rstrip and lstrip to get rid of the leading/trailing whitespace but I was just wondering if there's a way to do it with regex.

CodePudding user response:

You can use r"^(.{0,19}\S\b|)" (regex demo), \S ensuring to have a non space character on the bound. You need to decrease the number of previous characters to 19 and use | with empty string to match 0 characters if needed:

import re
s1 = "Hello,    World! This is a reallly long string"
match = re.search(r"^(.{0,19}\S\b|)", s1)
print(f"'{match.group(0)}'", len(match.group(0)))

Output:

'Hello,    World' 15

For the end of string r"(|\b\S.{0,19})$" (regex demo):

import re
s1 = "Hello,    World! This is a reallly long string"
match = re.search(r"(|\b\S.{0,19})$", s1)
print(f"'{match.group(0)}'", len(match.group(0)))

output:

'reallly long string' 19
why (...|)?

to enable zeros characters, the below example would fail with ^(.{0,19}\S\b)

import re
s1 = "X"*21
match = re.search(r"^(.{0,19}\S\b|)$", s1)
print(f"'{match.group(0)}'", len(match.group(0)))

output:

'' 0

CodePudding user response:

You may use this regex:

^\S.{0,18}\S\b|\b\S.{0,18}\S$

\S (not a whitespace) at start and end guarantees that your matches start and with with a non-whitespace character.

RegEx Demo

Code Demo

code:

import re

s = "Hello,    World! This is a reallly long string"

print(re.findall(r'^\S.{0,18}\S\b|\b\S.{0,18}\S$', s))
# ['Hello,    World', 'reallly long string']
  • Related