Using RegEx to match all substring except those who start specific char-CodePudding

I have the following string:

string = '"General Slocum" 15 June 1904 East River _ human factor %_ %& 4'

Q: Using RegEx, extract entire string except substrings who start with S.

Code for finding substring that starts with S:

print(re.findall('S[\w] ', string)

output: ['Slocum']

The best solution I came up with, using the sub method:

print(re.sub('S\w ', '',string))

output: "General " 15 June 1904 East River. _ human factor %_ %& 4

=====================================================================

Problem: Not able to write a regex that recognize all substrings that starts with the specific char S, and then return all except for these substrings.

Example :

print(re.findall('[^ S\w ][\w\%\d\&\.] ', string))

output: ['"General', '%__', '%&']

CodePudding user response：

Using re.findall you can use a capture group for what you want to keep, and match what you don't want to keep.

Depending on how specific you want the pattern to be, you might use:

(?<!\S)S\S*|(\S )

(?<!\S) Assert a whitspace boundary to the left
S\S* Match an S char and optional non whitspace chars
| Or
(\S ) Capture group 1, match 1 non whitespace chars

See a regex demo

Or for only word characters:

(?<!\S)S\w |(\S )

Here the S\w matches an S char and 1 or more word chars, matching at least 2 chars.

Regex demo

For example

import re

pattern = r"(?<!\S)S\S*|(\S )"
string = '"General Slocum" 15 June 1904 East River _ human factor _%__ %& 4'
print ([s for s in re.findall(pattern, string) if s])

Output

['"General', '15', 'June', '1904', 'East', 'River', '_', 'human', 'factor', '_%__', '%&', '4']

CodePudding user response：

Here is a non-regex approach that splits the input and then checks if each element doesn't start with letter s:

string = '"General Slocum" 15 June 1904 East River _ human factor _%__ %& 4'

print ([s for s in string.split() if not s.lower().startswith('s')])

Output:

['"General', '15', 'June', '1904', 'East', 'River', '_', 'human', 'factor', '_%__', '%&', '4']