I have the following string:
string = '"General Slocum" 15 June 1904 East River _ human factor %_ %& 4'
Q: Using RegEx, extract entire string except substrings who start with S.
Code for finding substring that starts with S:
print(re.findall('S[\w] ', string)
output: ['Slocum']
The best solution I came up with, using the sub method:
print(re.sub('S\w ', '',string))
output: "General " 15 June 1904 East River. _ human factor %_ %& 4
=====================================================================
Problem: Not able to write a regex that recognize all substrings that starts with the specific char S, and then return all except for these substrings.
Example :
print(re.findall('[^ S\w ][\w\%\d\&\.] ', string))
output: ['"General', '%__', '%&']
CodePudding user response:
Using re.findall you can use a capture group for what you want to keep, and match what you don't want to keep.
Depending on how specific you want the pattern to be, you might use:
(?<!\S)S\S*|(\S )
(?<!\S)
Assert a whitspace boundary to the leftS\S*
Match anS
char and optional non whitspace chars|
Or(\S )
Capture group 1, match 1 non whitespace chars
See a regex demo
Or for only word characters:
(?<!\S)S\w |(\S )
Here the S\w
matches an S
char and 1 or more word chars, matching at least 2 chars.
For example
import re
pattern = r"(?<!\S)S\S*|(\S )"
string = '"General Slocum" 15 June 1904 East River _ human factor _%__ %& 4'
print ([s for s in re.findall(pattern, string) if s])
Output
['"General', '15', 'June', '1904', 'East', 'River', '_', 'human', 'factor', '_%__', '%&', '4']
CodePudding user response:
Here is a non-regex approach that splits the input and then checks if each element doesn't start with letter s
:
string = '"General Slocum" 15 June 1904 East River _ human factor _%__ %& 4'
print ([s for s in string.split() if not s.lower().startswith('s')])
Output:
['"General', '15', 'June', '1904', 'East', 'River', '_', 'human', 'factor', '_%__', '%&', '4']