Home > database >  Lowercase based on regex pattern in Django & Python
Lowercase based on regex pattern in Django & Python

Time:09-17

The script I am using calls s_lower method to transform all text to lowercase but there is a catch: if it is a link (there is a special regex), then it does not lowercase it. So, I would like to apply the same or similar logic with other regex.

RE_WEBURL_NC = (
    r"(?:(?:(?:(?:https?):)\/\/)(?:(?!(?:10|127)(?:\.\d{1,3}){3})(?!(?:169\.254|192\.168)(?:\.\d{1,3}){2})(?!172\.(?:1["
    r"6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?"
    r":[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z0-9][a-z0-9_-]{0,62})?[a-z0-9]\.) (?:[a-z]{2,}\.?))(?::\d{2,5})?)(?:"
    r"(?:[/?#](?:(?![\s\"<>{}|\\^~\[\]`])(?!&lt;|&gt;|&quot;|&#x27;).)*))?"
)

def s_lower(value):
    url_nc = re.compile(f"({RE_WEBURL_NC})")

    # Do not lowercase links
    if url_nc.search(value):
        substrings = url_nc.split(value)
        for idx, substr in enumerate(substrings):
            if not url_nc.match(substr):
                substrings[idx] = i18n_lower(substr)
        return "".join(substrings)

    return i18n_lower(value)

I want to lowercase all text other than text inside the special tags.

def s_lower(value):
    spec_nc = re.compile(r"\[spec .*\]") # this is for [spec some raNdoM cAsE text here]

    if spec_nc.search(value):
        substrings = spec_nc.split(value)
        for idx, substr in enumerate(substrings):
            if not spec_nc.match(substr):
                substrings[idx] = i18n_lower(substr)
        return "".join(substrings)

    return i18n_lower(value)

Based on @Nick's answer, I made some changes and everything works but now I have a more complicated case. Script uses another special tag to preserve all spaces, indentation etc. [spec-complex text with tabs, spaces, new lines etc here]:

SPEC_COMPLEX_EXPR = r"(?:spec-complex)"
SPEC_COMPLEX_REGEX = fr"\[{SPEC_COMPLEX_EXPR}\s?(?:\s*\n)?([^\]]*)\]\s?\n?" # do not change this

I need to use SPEC_COMPLEX_REGEX in s_lower, so basically want to add to spec_nc:

def s_lower(value):
    # original spec_nc:
    # spec_nc = re.compile(r"((?:\[spec [^]]*\])|(?:RE_WEBURL_NC))")

    # what I want is something like:
    spec_nc = re.compile(r"((?:\[spec [^]]*\])|(?:RE_WEBURL_NC)|(?:\[spec-complex\s?(?:\s*\n)?([^\]]*)\]\s?\n?))")

    substrings = spec_nc.split(value)
    for idx, substr in enumerate(substrings):
        if idx % 2 == 0:
            substrings[idx] = i18n_lower(substr)
    return "".join(substrings)

This is also doesn't lowercase the capital letters in [spec-complex] but now does not preserve the spaces, tabs, indentation etc.

CodePudding user response:

Was writing this as a comment, but it got too long...

You haven't actually said what your problem is, but it looks like you're missing the () around the regex (so that the split string ends up in substrings). It should be

spec_nc = re.compile(r"(\[spec .*\])")

Note:

  • you should use [^]]* instead of .* to ensure your match stays within a single set of [].
  • you don't really need to search, if the string is not present then split will simply return the original string in a single element list which you can still iterate
  • you don't need the call to match; the strings which match the split regex will always be in the odd indexes of the list so you can just lower case dependent on idx

So you can simplify your code to:

def s_lower(value):
    spec_nc = re.compile(r"(\[spec [^]]*\])") # this is for [spec some raNdoM cAsE text here]
    
    substrings = spec_nc.split(value)
    for idx, substr in enumerate(substrings):
        if idx % 2 == 0:
            substrings[idx] = i18n_lower(substr)
    return "".join(substrings)
  • Related