Home > Software design >  regex remove special character except first occurrence and last occurrence
regex remove special character except first occurrence and last occurrence

Time:06-07

I need to remove the special characters inside angular brackets(<>)

date = "<dd_mm_yyyy>"
date_check_pattern= re.sub("[^A-Za-z<>]","",date)

But it is not working for

date = "<dd>_<mm>_<yyyy>"
#expected output <ddmmyyyy>

How to remove this angle bracket except first and last occurrence

CodePudding user response:

You could match the whole string and capture what is between the < and >

Then do a second replacement removing all not allowed characters from the group 1 match in the first pattern.

The first pattern matches

  • <(.*?)> Match from <...> and capture what is in between in group 1
  • (?= Positive lookahead, assert what is to the right is
    • [^\s<>]*$ Optionally match any char except < or > till the end of the string
  • ) Close the lookahead

Regex demo

The negated character class [^A-Za-z0-9] in the second sub means matching 1 times any character except what is listed in the character class.

Example code

import re

strings = [
    "<dd>_<mm>_<yyyy>",
    "<dd_mm_yyyy>",
    "<01>_<01>_<2022>",
    "file_name<dd>_<mm>_<yyyy>.csv",
    "file_name<01>_<01>_<2022>.csv",
    "file_name_<dd>_<mm>_<yyyy>_anything.csv"
]

for s in strings:
    print(
        re.sub(
            r'<(.*?)>(?=[^\s<>]*$)',
            lambda x: f"<{re.sub(r'[^A-Za-z0-9] ', '', x.group(1))}>",
            s
        )
    )

Output

<ddmmyyyy>
<ddmmyyyy>
<01012022>
file_name<ddmmyyyy>.csv
file_name<01012022>.csv
file_name_<ddmmyyyy>_anything.csv

CodePudding user response:

This might not be exactly what you want but if the characters always come at the start and the end of the line then you can use a negative lookbehind and a negative lookahead to not match the characters at start or end

import re

date = "<dd_mm_yyyy>"
date_check_pattern = re.sub("(?<!^)[^A-Za-z<>](?!$)","",date)
print(date_check_pattern)

output

<ddmmyyyy>

CodePudding user response:

I suggest you do not use Regex for this. You can use the following function and also customize specialChar

def removeSpecialChar(date):
    result = ""
    open = False
    specialChar = "_<>"
    index = 0

    lastClose = len(date)-date[::-1].find(">")-1 if date[::-1].find(">") != -1 else None
    if not lastClose: return date

    for c in date:
        if not open and c == "<":
            open = True
            result  = c
        elif open and c == ">" and index==lastClose : open = False

        if open and c not in specialChar:
            result  = c
        elif not open:
            result  = c

        index =1

    return result

example

>>> removeSpecialChar("<dd>_<mm>_<yyyy>")
"<ddmmyyyy>"

>>> removeSpecialChar("Date: (<dd>_<mm>_<yyyy>)")
"Date: (<ddmmyyyy>)"
  • Related