I am trying to find and replace words with ' '
for 2 different queries
- Find and replace words which have the same character repeated more than 3 times continuously
OR
- Find and replace words which have any special Characters repeated 3 or more times continuously.
Looking at the below query:
re.findall(r'([a-zA-Z])\1{3,}', 'I doono if HELLO && AA -AA should be here but hellllooooo or Whyyy should definitely be. So should , x =-y --- ')
it gives the alphabet repeated where it should be ['hellllooooo', 'Whyyy', ' ,' , 'x =-y', '---']
and
re.findall(r'[^a-zA-Z0-9 ]{3,}', 'I doono if HELLO should be here but hellllooooo or Whyyy should be here. So should , x =-y --- ')
gives almost accurate results except x =-y
is given as =-
So after applying these two conditions, the final result looks like:
I doono if HELLO && AA -AA should be here but or should definitely be. So should
CodePudding user response:
You seem to consider any one or more non-whitespace chunks as "words". In these cases, it is much easier to work with lists obtained with text.split()
, filter out those items that do not match the regex and then join the list back with a space:
import re
text = 'I doono if HELLO && AA -AA should be here but hellllooooo or Whyyy should definitely be. So should , x =-y --- '
rx = re.compile(r'([a-zA-Z])\1{2,}|[^a-zA-Z0-9\s]{3,}')
print( " ".join(x for x in text.split() if not rx.search(x)) )
# => I doono if HELLO && AA -AA should be here but or should definitely be. So should
See the Python demo. The regex is simple:
([a-zA-Z])\1{2,}
- an ASCII letter and then two or more occurrences of the same letter|
- or[^a-zA-Z0-9\s]{3,}
- three or more chars other than ASCII letters, digits and any whitespace chars.
The " ".join(x for x in text.split() if not rx.search(x))
part splits the text
with whitespaces (text.split()
), removes all chunks that do not match the regex (if not rx.search(x)
) and joins back to a string (with " ".join(...)
).
CodePudding user response:
For the example given in the question, replacing all matches of the regular expression
(?=\w*(\w)\1{2})\w*|(?=[^ ]*[ ,=')-]{3})[^ ]
with a single space (' '
) results in the following string:
"I doono if HELLO && AA -AA should be here but or should definitely be. So should "
As "special characters" were not defined in the question I assumed them to be those in the string ,=')-
. That could be easily changed, of course.
The resulting string is not quite what was asked for, in that a space was added for every match and no spaces were removed, but it appears to me to be consistent with the stated replacement rules.
The regular expression can be broken down as follows.
(?= # begin a positive lookahead
\w*(\w) # match >= 0 characters followed by a word character,
# the latter being saved to capture group 1
\1{2} # match the character in capture group 1 twice
) # end positive lookahead
\w* # match >= 0 word characters (will be >= 3)
| # or
(?= # begin a positive lookahead
[^ ]* # match >= 0 characters other than spaces
[ ,=')-]{3} # match three special characters
) # end positive lookahead
[^ ]* # match >= 0 characters other than spaces (will be >= 3)
(?=\w*(\w)\1{2})
forces the following string of word characters to contain a word character repeated (at least) three times in a row.
(?=[^ ]*[ ,=')-]{3})
forces the following string of non-space characters to contain three special characters in a row.
If desired, the extra spaces can be removed by modifying the regular expression slightly and replacing matches with empty strings:
(?=\w*(\w)\1{2})\w* *|(?=[^ ]*[ ,=')-]{3,})[^ ] *