How can I convert `A_B_C_DEF` to `ABC

I have strings of this form:

A_B_CDEF_GHI
A_B_C_DEF_G_H_I
ABC_D_E_F_GHI
ABCDEFG_H_I
A_B_C

I need to convert those to the following:

AB_CDEF_GHI
ABC_DEF_GHI
ABC_DEF_GHI
ABCDEFG_HI
ABC

So the rules are:

(._){2,} should be converted to XXX_ if it's not at the end of the string.
If (_.){2,} occurs at the end of a string, it should be converted to _XXX.
If (_.){2,}. is the entire string, all underscores should be removed.

I've gotten to (((.)_){2,}), which does match the first rule, but how can I replace it with the non-underscore characters it found?

The python tag is present because that's where the code is, and I know regex dialects depend on the language.

CodePudding user response：

The dot in your example code matches any character including an underscore. You can make the pattern a bit more specific instead.

You can get all of the double A-Z matches out of the way, and capture the single A-Z followed by _ and A-Z in a group.

Then for the capture group replace the _ with an empty string.

_?[A-Z]{2,}_?|([A-Z](?:_[A-Z](?![A-Z])) )

_?[A-Z]{2,}_? Match 2 or more occurences of A-Z surrounded by optional underscores
| or
( Capture group 1
- [A-Z] Match a single A-Z
- (?:_[A-Z](?![A-Z])) Repeat 1 times _ and A-Z asserting not A-Z to the right
) Close group 1

See a regex demo and a Python demo

For example:

import re
pattern = r'_?[A-Z]{2,}_?|([A-Z](?:_[A-Z](?![A-Z])) )'
s = ("A_B_CDEF_GHI\n"
            "A_B_C_DEF_G_H_I\n"
            "ABC_D_E_F_GHI\n"
            "ABCDEFG_H_I\n"
            "A_B_C")

res = re.sub(pattern, lambda x: x.group(1).replace("_", "") if x.group(1) else x.group(), s)
print(res)

Output

AB_CDEF_GHI
ABC_DEF_GHI
ABC_DEF_GHI
ABCDEFG_HI
ABC

A bit broader match instead of characters A-Z could be using a negated character class matching any char except a whitespace char or underscore

_?[^_\s]{2,}_?|([^_\s](?:_[^_\s](?![^_\s])) )

CodePudding user response：

Here's a solution without regular expressions:

def convert(s: str) -> str:
    """ https://stackoverflow.com/q/71578300 """

    def _get_combined_parts() -> Iterator[str]:
        """
        yields the ``_``-separated parts of ``s``
        where subsequent single-character parts have been combined
        """
        combined_part = ""
        for part in s.split("_"):
            if len(part) <= 1:
                combined_part  = part
            else:
                if combined_part:
                    yield combined_part
                yield part
                combined_part = ""
        if combined_part:
            yield combined_part

    return "_".join(_get_combined_parts())