Home > Blockchain >  How to find words with same characters or ANY of the special characters repeated continuously (3 ti
How to find words with same characters or ANY of the special characters repeated continuously (3 ti

Time:02-16

I am trying to find and replace words with ' ' for 2 different queries

  1. Find and replace words which have the same character repeated more than 3 times continuously

OR

  1. Find and replace words which have any special Characters repeated 3 or more times continuously.

Looking at the below query:

re.findall(r'([a-zA-Z])\1{3,}', 'I doono if HELLO && AA -AA should be here but hellllooooo or Whyyy should definitely be. So should   , x =-y  --- ')

it gives the alphabet repeated where it should be ['hellllooooo', 'Whyyy', ' ,' , 'x =-y', '---']

and

re.findall(r'[^a-zA-Z0-9 ]{3,}', 'I doono if HELLO should be here but hellllooooo or Whyyy should be here. So should   , x =-y  --- ') 

gives almost accurate results except x =-y is given as =-

So after applying these two conditions, the final result looks like:

I doono if HELLO && AA -AA should be here but or should definitely be. So should

CodePudding user response:

You seem to consider any one or more non-whitespace chunks as "words". In these cases, it is much easier to work with lists obtained with text.split(), filter out those items that do not match the regex and then join the list back with a space:

import re
text = 'I doono if HELLO && AA -AA should be here but hellllooooo or Whyyy should definitely be. So should   , x =-y  --- '
rx = re.compile(r'([a-zA-Z])\1{2,}|[^a-zA-Z0-9\s]{3,}')
print( " ".join(x for x in text.split() if not rx.search(x)) )
# => I doono if HELLO && AA -AA should be here but or should definitely be. So should

See the Python demo. The regex is simple:

  • ([a-zA-Z])\1{2,} - an ASCII letter and then two or more occurrences of the same letter
  • | - or
  • [^a-zA-Z0-9\s]{3,} - three or more chars other than ASCII letters, digits and any whitespace chars.

The " ".join(x for x in text.split() if not rx.search(x)) part splits the text with whitespaces (text.split()), removes all chunks that do not match the regex (if not rx.search(x)) and joins back to a string (with " ".join(...)).

CodePudding user response:

For the example given in the question, replacing all matches of the regular expression

(?=\w*(\w)\1{2})\w*|(?=[^ ]*[ ,=')-]{3})[^ ] 

with a single space (' ') results in the following string:

"I doono if HELLO && AA -AA should be here but   or   should definitely be. So should        "

As "special characters" were not defined in the question I assumed them to be those in the string ,=')-. That could be easily changed, of course.

The resulting string is not quite what was asked for, in that a space was added for every match and no spaces were removed, but it appears to me to be consistent with the stated replacement rules.

Demo

The regular expression can be broken down as follows.

(?=            # begin a positive lookahead
  \w*(\w)      # match >= 0 characters followed by a word character,
               # the latter being saved to capture group 1
  \1{2}        # match the character in capture group 1 twice
)              # end positive lookahead
\w*            # match >= 0 word characters (will be >= 3)
|              # or
(?=            # begin a positive lookahead
  [^ ]*        # match >= 0 characters other than spaces
  [ ,=')-]{3}  # match three special characters
)              # end positive lookahead
[^ ]*          # match >= 0 characters other than spaces (will be >= 3)

(?=\w*(\w)\1{2}) forces the following string of word characters to contain a word character repeated (at least) three times in a row.

(?=[^ ]*[ ,=')-]{3}) forces the following string of non-space characters to contain three special characters in a row.


If desired, the extra spaces can be removed by modifying the regular expression slightly and replacing matches with empty strings:

(?=\w*(\w)\1{2})\w* *|(?=[^ ]*[ ,=')-]{3,})[^ ]  *

Demo

  • Related